1
|
Nassour N, Akhbari B, Ranganathan N, Tawakol A, Rosovsky RP, Guss D, DiGiovanni CW, Ashkani-Esfahani S. Correlation Between Statin Use and Symptomatic Venous Thromboembolism Incidence in Patients With Ankle Fracture: A Machine Learning Approach. Foot Ankle Spec 2024; 17:604-612. [PMID: 37905534 DOI: 10.1177/19386400231207692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/02/2023]
Abstract
BACKGROUND Identifying factors that correlate with the incidence of venous thromboembolism (VTE) has the potential to improve VTE prevention and positively influence decision-making regarding prophylaxis. In this study, we aimed to investigate the correlation between statin consumption and the incidence of VTE in patients who sustained an ankle fracture. METHODS In this retrospective, case-controlled study, cases were those who developed VTE and controls were those who had no VTE, and the ratio was 1:4. Patients' demographics, history of hyperlipidemia, and reported statins use were obtained. A random forest classifier (RFC) model was used to predict whether statin consumers were at risk of VTE after ankle fracture regardless of VTE prophylaxis administration based on statin consumption, body mass index (BMI), age, and biological sex. RESULTS Of the 1175 patients with ankle fractures, 238 had confirmed VTE (case group), and 937 had no symptomatic VTE (control group; ratio 1:4). Fifty (21%) cases and 407 (43%) controls were on a statin. Statin users had a significantly lower incidence of VTE after ankle fracture, odds ratio (OR) = 0.35, 95% CI: 0.25, 0.49, P < .001. Our model showed an area under the receiving operator curve (AUROC) of 78%, a sensitivity of 73%, and a specificity of 83% in predicting the risk of VTE. The importance of the predictors of VTE, other than the use of statins (model importance = 0.1), were age (model importance of 0.72), BMI (model importance of 0.24), and biological sex (model importance of 0.02). CONCLUSION Statins were significantly associated with a lower rate of VTE in our population of patients who sustained an ankle fracture. LEVELS OF EVIDENCE 3.
Collapse
Affiliation(s)
- Nour Nassour
- Foot & Ankle Research and Innovation Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Bardiya Akhbari
- Foot & Ankle Research and Innovation Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Noopur Ranganathan
- Foot & Ankle Research and Innovation Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Ahmed Tawakol
- Division of Cardiology, Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts
| | - Rachel P Rosovsky
- Division of Hematology, Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts
| | - Daniel Guss
- Foot & Ankle Research and Innovation Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
- Foot and Ankle Division, Department of Orthopaedic Surgery, Massachusetts General Hospital, Newton Wellesley Hospital, Harvard Medical School, Boston, Massachusetts
| | - Christopher W DiGiovanni
- Foot & Ankle Research and Innovation Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
- Foot and Ankle Division, Department of Orthopaedic Surgery, Massachusetts General Hospital, Newton Wellesley Hospital, Harvard Medical School, Boston, Massachusetts
| | - Soheil Ashkani-Esfahani
- Foot & Ankle Research and Innovation Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
- Foot and Ankle Division, Department of Orthopaedic Surgery, Massachusetts General Hospital, Newton Wellesley Hospital, Harvard Medical School, Boston, Massachusetts
| |
Collapse
|
2
|
Wang M, Li G, Dong L, Hou Z, Zhang J, Li D. Severity Identification of Graves Orbitopathy via Random Forest Algorithm. Horm Metab Res 2024; 56:706-711. [PMID: 38588699 DOI: 10.1055/a-2287-3734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/10/2024]
Abstract
This study aims to establish a random forest model for detecting the severity of Graves Orbitopathy (GO) and identify significant classification factors. This is a hospital-based study of 199 patients with GO that were collected between December 2019 and February 2022. Clinical information was collected from medical records. The severity of GO can be categorized as mild, moderate-to-severe, and sight-threatening GO based on guidelines of the European Group on Graves' orbitopathy. A random forest model was constructed according to the risk factors of GO and the main ocular symptoms of patients to differentiate mild GO from severe GO and finally was compared with logistic regression analysis, Support Vector Machine (SVM), and Naive Bayes. A random forest model with 15 variables was constructed. Blurred vision, disease course, thyroid-stimulating hormone receptor antibodies, and age ranked high both in mini-decreased gini and mini decrease accuracy. The accuracy, positive predictive value, negative predictive value, and the F1 Score of the random forest model are 0.83, 0.82, 0.86, and 0.82, respectively. Compared to the three other models, our random forest model showed a more reliable performance based on AUC (0.85 vs. 0.83 vs. 0.80 vs. 0.76) and accuracy (0.83 vs. 0.78 vs. 0.77 vs. 0.70). In conclusion, this study shows the potential for applying a random forest model as a complementary tool to differentiate GO severity.
Collapse
Affiliation(s)
- Minghui Wang
- Beijing Tongren Eye Center, Beijing Tongren Hospital, Beijing Ophthalmology and Visual Science Key Lab, Beijing, China
- Department of Ophthalmology, Faculty of Medicine and University Hospital Cologne, University of Cologne, Cologne, Germany
| | - Gongfei Li
- Department of Neurology and Stroke, University of Tübingen, Tübingen, Germany
| | - Li Dong
- Beijing Tongren Eye Center, Beijing Tongren Hospital, Beijing Ophthalmology and Visual Science Key Lab, Beijing, China
| | - Zhijia Hou
- Beijing Tongren Eye Center, Beijing Tongren Hospital, Beijing Ophthalmology and Visual Science Key Lab, Beijing, China
| | - Ju Zhang
- Beijing Tongren Eye Center, Beijing Tongren Hospital, Beijing Ophthalmology and Visual Science Key Lab, Beijing, China
| | - Dongmei Li
- Beijing Tongren Eye Center, Beijing Tongren Hospital, Beijing Ophthalmology and Visual Science Key Lab, Beijing, China
| |
Collapse
|
3
|
Tang J, Mou M, Zheng X, Yan J, Pan Z, Zhang J, Li B, Yang Q, Wang Y, Zhang Y, Gao J, Li S, Yang H, Zhu F. Strategy for Identifying a Robust Metabolomic Signature Reveals the Altered Lipid Metabolism in Pituitary Adenoma. Anal Chem 2024; 96:4745-4755. [PMID: 38417094 DOI: 10.1021/acs.analchem.3c03796] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/01/2024]
Abstract
Despite the well-established connection between systematic metabolic abnormalities and the pathophysiology of pituitary adenoma (PA), current metabolomic studies have reported an extremely limited number of metabolites associated with PA. Moreover, there was very little consistency in the identified metabolite signatures, resulting in a lack of robust metabolic biomarkers for the diagnosis and treatment of PA. Herein, we performed a global untargeted plasma metabolomic profiling on PA and identified a highly robust metabolomic signature based on a strategy. Specifically, this strategy is unique in (1) integrating repeated random sampling and a consensus evaluation-based feature selection algorithm and (2) evaluating the consistency of metabolomic signatures among different sample groups. This strategy demonstrated superior robustness and stronger discriminative ability compared with that of other feature selection methods including Student's t-test, partial least-squares-discriminant analysis, support vector machine recursive feature elimination, and random forest recursive feature elimination. More importantly, a highly robust metabolomic signature comprising 45 PA-specific differential metabolites was identified. Moreover, metabolite set enrichment analysis of these potential metabolic biomarkers revealed altered lipid metabolism in PA. In conclusion, our findings contribute to a better understanding of the metabolic changes in PA and may have implications for the development of diagnostic and therapeutic approaches targeting lipid metabolism in PA. We believe that the proposed strategy serves as a valuable tool for screening robust, discriminating metabolic features in the field of metabolomics.
Collapse
Affiliation(s)
- Jing Tang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
- Department of Bioinformatics, Chongqing Medical University, Chongqing 400016, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Xin Zheng
- Multidisciplinary Center for Pituitary Adenoma of Chongqing, Department of Neuosurgery, Xinqiao Hospital, Army Medical University, Chongqing 400037, China
| | - Jin Yan
- Multidisciplinary Center for Pituitary Adenoma of Chongqing, Department of Neuosurgery, Xinqiao Hospital, Army Medical University, Chongqing 400037, China
| | - Ziqi Pan
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Jinsong Zhang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Bo Li
- School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Qingxia Yang
- School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Yunxia Wang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Ying Zhang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Jianqing Gao
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Song Li
- Multidisciplinary Center for Pituitary Adenoma of Chongqing, Department of Neuosurgery, Xinqiao Hospital, Army Medical University, Chongqing 400037, China
| | - Hui Yang
- Multidisciplinary Center for Pituitary Adenoma of Chongqing, Department of Neuosurgery, Xinqiao Hospital, Army Medical University, Chongqing 400037, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| |
Collapse
|
4
|
Tschodu D, Lippoldt J, Gottheil P, Wegscheider AS, Käs JA, Niendorf A. Re-evaluation of publicly available gene-expression databases using machine-learning yields a maximum prognostic power in breast cancer. Sci Rep 2023; 13:16402. [PMID: 37798300 PMCID: PMC10556090 DOI: 10.1038/s41598-023-41090-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Accepted: 08/22/2023] [Indexed: 10/07/2023] Open
Abstract
Gene expression signatures refer to patterns of gene activities and are used to classify different types of cancer, determine prognosis, and guide treatment decisions. Advancements in high-throughput technology and machine learning have led to improvements to predict a patient's prognosis for different cancer phenotypes. However, computational methods for analyzing signatures have not been used to evaluate their prognostic power. Contention remains on the utility of gene expression signatures for prognosis. The prevalent approaches include random signatures, expert knowledge, and machine learning to construct an improved signature. We unify these approaches to evaluate their prognostic power. Re-evaluation of publicly available gene-expression data from 8 databases with 9 machine-learning models revealed previously unreported results. Gene-expression signatures are confirmed to be useful in predicting a patient's prognosis. Convergent evidence from [Formula: see text] 10,000 signatures implicates a maximum prognostic power. By calculating the concordance index, which measures how well patients with different prognoses can be discriminated, we show that a signature can correctly discriminate patients' prognoses no more than 80% of the time. Additionally, we show that more than 50% of the potentially available information is still missing at this value. We surmise that an accurate prognosis must incorporate molecular, clinical, histological, and other complementary factors.
Collapse
Affiliation(s)
- Dimitrij Tschodu
- Peter Debye Institute for Soft Matter Physics, Leipzig University, 04103, Leipzig, Germany.
| | - Jürgen Lippoldt
- Peter Debye Institute for Soft Matter Physics, Leipzig University, 04103, Leipzig, Germany
| | - Pablo Gottheil
- Peter Debye Institute for Soft Matter Physics, Leipzig University, 04103, Leipzig, Germany
| | - Anne-Sophie Wegscheider
- Institute for Histology, Cytology and Molecular Diagnostics, MVZ Prof. Dr. med. A. Niendorf Pathologie Hamburg-West GmbH, 22767, Hamburg, Germany
| | - Josef A Käs
- Peter Debye Institute for Soft Matter Physics, Leipzig University, 04103, Leipzig, Germany.
| | - Axel Niendorf
- Institute for Histology, Cytology and Molecular Diagnostics, MVZ Prof. Dr. med. A. Niendorf Pathologie Hamburg-West GmbH, 22767, Hamburg, Germany.
| |
Collapse
|
5
|
Alromema N, Syed AH, Khan T. A Hybrid Machine Learning Approach to Screen Optimal Predictors for the Classification of Primary Breast Tumors from Gene Expression Microarray Data. Diagnostics (Basel) 2023; 13:diagnostics13040708. [PMID: 36832196 PMCID: PMC9955903 DOI: 10.3390/diagnostics13040708] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Revised: 01/30/2023] [Accepted: 02/07/2023] [Indexed: 02/16/2023] Open
Abstract
The high dimensionality and sparsity of the microarray gene expression data make it challenging to analyze and screen the optimal subset of genes as predictors of breast cancer (BC). The authors in the present study propose a novel hybrid Feature Selection (FS) sequential framework involving minimum Redundancy-Maximum Relevance (mRMR), a two-tailed unpaired t-test, and meta-heuristics to screen the most optimal set of gene biomarkers as predictors for BC. The proposed framework identified a set of three most optimal gene biomarkers, namely, MAPK 1, APOBEC3B, and ENAH. In addition, the state-of-the-art supervised Machine Learning (ML) algorithms, namely Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Neural Net (NN), Naïve Bayes (NB), Decision Tree (DT), eXtreme Gradient Boosting (XGBoost), and Logistic Regression (LR) were used to test the predictive capability of the selected gene biomarkers and select the most effective breast cancer diagnostic model with higher values of performance matrices. Our study found that the XGBoost-based model was the superior performer with an accuracy of 0.976 ± 0.027, an F1-Score of 0.974 ± 0.030, and an AUC value of 0.961 ± 0.035 when tested on an independent test dataset. The screened gene biomarkers-based classification system efficiently detects primary breast tumors from normal breast samples.
Collapse
Affiliation(s)
- Nashwan Alromema
- Department of Computer Science, Faculty of Computing and Information Technology Rabigh (FCITR), King Abdulaziz University, Jeddah 22254, Saudi Arabia
- Correspondence:
| | - Asif Hassan Syed
- Department of Computer Science, Faculty of Computing and Information Technology Rabigh (FCITR), King Abdulaziz University, Jeddah 22254, Saudi Arabia
| | - Tabrej Khan
- Department of Information Systems, Faculty of Computing and Information Technology Rabigh (FCITR), King Abdulaziz University, Jeddah 22254, Saudi Arabia
| |
Collapse
|
6
|
Giesemann J, Delgadillo J, Schwartz B, Bennemann B, Lutz W. Predicting dropout from psychological treatment using different machine learning algorithms, resampling methods, and sample sizes. Psychother Res 2023:1-13. [PMID: 36669124 DOI: 10.1080/10503307.2022.2161432] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
OBJECTIVE The occurrence of dropout from psychological interventions is associated with poor treatment outcome and high health, societal and economic costs. Recently, machine learning (ML) algorithms have been tested in psychotherapy outcome research. Dropout predictions are usually limited by imbalanced datasets and the size of the sample. This paper aims to improve dropout prediction by comparing ML algorithms, sample sizes and resampling methods. METHOD Twenty ML algorithms were examined in twelve subsamples (drawn from a sample of N = 49,602) using four resampling methods in comparison to the absence of resampling and to each other. Prediction accuracy was evaluated in an independent holdout dataset using the F1-Measure. RESULTS Resampling methods improved the performance of ML algorithms and down-sampling can be recommended, as it was the fastest method and as accurate as the other methods. For the highest mean F1-Score of .51 a minimum sample size of N = 300 was necessary. No specific algorithm or algorithm group can be recommended. CONCLUSION Resampling methods could improve the accuracy of predicting dropout in psychological interventions. Down-sampling is recommended as it is the least computationally taxing method. The training sample should contain at least 300 cases.
Collapse
Affiliation(s)
- Julia Giesemann
- Clinical Psychology and Psychotherapy, Department of Psychology, University of Trier, Trier, Germany
| | - Jaime Delgadillo
- Clinical and Applied Psychology Unit, Department of Psychology, University of Sheffield, Sheffield, UK
| | - Brian Schwartz
- Clinical Psychology and Psychotherapy, Department of Psychology, University of Trier, Trier, Germany
| | - Björn Bennemann
- Clinical Psychology and Psychotherapy, Department of Psychology, University of Trier, Trier, Germany
| | - Wolfgang Lutz
- Clinical Psychology and Psychotherapy, Department of Psychology, University of Trier, Trier, Germany
| |
Collapse
|
7
|
Yang K, Quddus M, Antoniou C. Developing a new real-time traffic safety management framework for urban expressways utilizing reinforcement learning tree. ACCIDENT; ANALYSIS AND PREVENTION 2022; 178:106848. [PMID: 36174250 DOI: 10.1016/j.aap.2022.106848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Revised: 08/21/2022] [Accepted: 09/15/2022] [Indexed: 06/16/2023]
Abstract
One of the main objectives of an urban traffic control system is to reduce the crash frequency and the loss caused by these crashes on urban expressways. Real-time crash risk prediction (RTCRP) is an essential technique to identify crash precursors so as to take proactive measures to smooth traffic fluctuations. In addition, automatic incident detection (AID) is another important approach to timely detect an incident so as to design countermeasures that reduce any negative impacts on traffic dynamics. With the introduction of disruptive technologies in transport, highly disaggregated large datasets have started to emerge for modelling while existing modelling techniques utilized in RTCRP and AID may not be able to accurately predict traffic crashes in real-time. Therefore, this paper proposes a state-of-the-art reinforcement learning tree (RLT) approach to develop RTCRP model and automatic crash detection (ACD) model similar to AID, and further utilizes it to build a real-time traffic safety management framework for urban expressways with the input of online traffic data streaming. Recorded traffic flow data and historical crash data were extracted and integrated to develop and implement both RTCRP models and ACD models. The prediction results were compared with the frequently used logistic regression (LR), support vector machine (SVM) and deep neural network (DNN) and a sensitivity analysis for variable effects was conducted. The results confirm that RLT outperforms LR, SVM and DNN in developing RTCRP and ACD models. At the cost of 10% false-alarm rate, about 96% of the crashes were predicted or detected correctly by the proposed framework. The results also indicate that: i) collecting more data is helpful to improve the predictive performance and approximatively a minimum sample size of 20 observations per variable is reasonable for training RLT models; and ii) obtaining more factors is beneficial to improve the predictive performance. With the RLT approach, it was demonstrated that selected important variables also have the capability to provide reasonable predictive performance.
Collapse
Affiliation(s)
- Kui Yang
- TUM School of Engineering and Design, Technical University of Munich, Arcisstraße 21, 80333 Munich, Germany.
| | - Mohammed Quddus
- Department of Civil and Environmental Engineering, Imperial College London, Exhibition Road, London SW7 2AZ, United Kingdom.
| | - Constantinos Antoniou
- TUM School of Engineering and Design, Technical University of Munich, Arcisstraße 21, 80333 Munich, Germany.
| |
Collapse
|
8
|
A Systematic Review of Applications of Machine Learning and Other Soft Computing Techniques for the Diagnosis of Tropical Diseases. Trop Med Infect Dis 2022; 7:tropicalmed7120398. [PMID: 36548653 PMCID: PMC9787706 DOI: 10.3390/tropicalmed7120398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2022] [Revised: 11/17/2022] [Accepted: 11/21/2022] [Indexed: 11/29/2022] Open
Abstract
This systematic literature aims to identify soft computing techniques currently utilized in diagnosing tropical febrile diseases and explore the data characteristics and features used for diagnoses, algorithm accuracy, and the limitations of current studies. The goal of this study is therefore centralized around determining the extent to which soft computing techniques have positively impacted the quality of physician care and their effectiveness in tropical disease diagnosis. The study has used PRISMA guidelines to identify paper selection and inclusion/exclusion criteria. It was determined that the highest frequency of articles utilized ensemble techniques for classification, prediction, analysis, diagnosis, etc., over single machine learning techniques, followed by neural networks. The results identified dengue fever as the most studied disease, followed by malaria and tuberculosis. It was also revealed that accuracy was the most common metric utilized to evaluate the predictive capability of a classification mode. The information presented within these studies benefits frontline healthcare workers who could depend on soft computing techniques for accurate diagnoses of tropical diseases. Although our research shows an increasing interest in using machine learning techniques for diagnosing tropical diseases, there still needs to be more studies. Hence, recommendations and directions for future research are proposed.
Collapse
|
9
|
Li G, Liu X, Wang M, Yu T, Ren J, Wang Q. Predicting the functional outcomes of anti-LGI1 encephalitis using a random forest model. Acta Neurol Scand 2022; 146:137-143. [PMID: 35373330 DOI: 10.1111/ane.13619] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2022] [Revised: 03/18/2022] [Accepted: 03/25/2022] [Indexed: 11/27/2022]
Abstract
OBJECTIVES To establish a model in order to predict the functional outcomes of patients with anti-leucine-rich glioma-inactivated 1 (LGI1) encephalitis and identify significant predictive factors using a random forest algorithm. METHODS Seventy-nine patients with confirmed LGI1 antibodies were retrospectively reviewed between January 2015 and July 2020. Clinical information was obtained from medical records and functional outcomes were followed up in interviews with patients or their relatives. Neurological functional outcome was assessed using a modified Rankin Scale (mRS), the cutoff of which was 2. The prognostic model was established using the random forest algorithm, which was subsequently compared with logistic regression analysis, Naive Bayes and Support vector machine (SVM) metrics based on the area under the curve (AUC) and the accuracy. RESULTS A total of 79 patients were included in the final analysis. After a median follow-up of 24 months (range, 8-60 months), 20 patients (25%) experienced poor functional outcomes. A random forest model consisting of 16 variables used to predict the poor functional outcomes of anti-LGI1 encephalitis was successfully constructed with an accuracy of 83% and an F1 score of 60%. In addition, the random forest algorithm demonstrated a more precise predictive performance for poor functional outcomes in patients with anti-LGI1 encephalitis compared with three other models (AUC, 0.90 vs 0.80 vs 0.70 vs 0.64). CONCLUSIONS The random forest model can predict poor functional outcomes of patients with anti-LGI1 encephalitis. This model was more accurate and reliable than the logistic regression, Naive Bayes, and SVM algorithm.
Collapse
Affiliation(s)
- Gongfei Li
- Department of Neurology Beijing Tiantan Hospital Capital Medical University Beijing China
| | - Xiao Liu
- Department of Neurology Beijing Tiantan Hospital Capital Medical University Beijing China
| | - Minghui Wang
- Beijing Tongren Eye Center, Beijing Tongren Hospital Capital Medical University Beijing China
| | - Tingting Yu
- Department of Neurology Beijing Tiantan Hospital Capital Medical University Beijing China
| | - Jiechuan Ren
- Department of Neurology Beijing Tiantan Hospital Capital Medical University Beijing China
- China National Clinical Research Center for Neurological Diseases Beijing China
| | - Qun Wang
- Department of Neurology Beijing Tiantan Hospital Capital Medical University Beijing China
- China National Clinical Research Center for Neurological Diseases Beijing China
- Beijing Institute for Brain Disorders Beijing China
| |
Collapse
|
10
|
Ba R, Geffard E, Douillard V, Simon F, Mesnard L, Vince N, Gourraud PA, Limou S. Surfing the Big Data Wave: Omics Data Challenges in Transplantation. Transplantation 2022; 106:e114-e125. [PMID: 34889882 DOI: 10.1097/tp.0000000000003992] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
In both research and care, patients, caregivers, and researchers are facing a leap forward in the quantity of data that are available for analysis and interpretation, marking the daunting "big data era." In the biomedical field, this quantitative shift refers mostly to the -omics that permit measuring and analyzing biological features of the same type as a whole. Omics studies have greatly impacted transplantation research and highlighted their potential to better understand transplant outcomes. Some studies have emphasized the contribution of omics in developing personalized therapies to avoid graft loss. However, integrating omics data remains challenging in terms of analytical processes. These data come from multiple sources. Consequently, they may contain biases and systematic errors that can be mistaken for relevant biological information. Normalization methods and batch effects have been developed to tackle issues related to data quality and homogeneity. In addition, imputation methods handle data missingness. Importantly, the transplantation field represents a unique analytical context as the biological statistical unit is the donor-recipient pair, which brings additional complexity to the omics analyses. Strategies such as combined risk scores between 2 genomes taking into account genetic ancestry are emerging to better understand graft mechanisms and refine biological interpretations. The future omics will be based on integrative biology, considering the analysis of the system as a whole and no longer the study of a single characteristic. In this review, we summarize omics studies advances in transplantation and address the most challenging analytical issues regarding these approaches.
Collapse
Affiliation(s)
- Rokhaya Ba
- Université de Nantes, Centre Hospitalier Universitaire Nantes, Institute of Health and Medical Research, Centre de Recherche en Transplantation et Immunologie, UMR 1064, Institut de Transplantation Urologie-Néphrologie, Nantes, France
- Département Informatique et Mathématiques, Ecole Centrale de Nantes, Nantes, France
| | - Estelle Geffard
- Université de Nantes, Centre Hospitalier Universitaire Nantes, Institute of Health and Medical Research, Centre de Recherche en Transplantation et Immunologie, UMR 1064, Institut de Transplantation Urologie-Néphrologie, Nantes, France
| | - Venceslas Douillard
- Université de Nantes, Centre Hospitalier Universitaire Nantes, Institute of Health and Medical Research, Centre de Recherche en Transplantation et Immunologie, UMR 1064, Institut de Transplantation Urologie-Néphrologie, Nantes, France
| | - Françoise Simon
- Université de Nantes, Centre Hospitalier Universitaire Nantes, Institute of Health and Medical Research, Centre de Recherche en Transplantation et Immunologie, UMR 1064, Institut de Transplantation Urologie-Néphrologie, Nantes, France
- Mount Sinai School of Medicine, New York, NY
| | - Laurent Mesnard
- Urgences Néphrologiques et Transplantation Rénale, Hôpital Tenon, Assistance Publique-Hôpitaux de Paris, Paris, France
- Sorbonne Université, Paris, France
| | - Nicolas Vince
- Université de Nantes, Centre Hospitalier Universitaire Nantes, Institute of Health and Medical Research, Centre de Recherche en Transplantation et Immunologie, UMR 1064, Institut de Transplantation Urologie-Néphrologie, Nantes, France
| | - Pierre-Antoine Gourraud
- Université de Nantes, Centre Hospitalier Universitaire Nantes, Institute of Health and Medical Research, Centre de Recherche en Transplantation et Immunologie, UMR 1064, Institut de Transplantation Urologie-Néphrologie, Nantes, France
| | - Sophie Limou
- Université de Nantes, Centre Hospitalier Universitaire Nantes, Institute of Health and Medical Research, Centre de Recherche en Transplantation et Immunologie, UMR 1064, Institut de Transplantation Urologie-Néphrologie, Nantes, France
- Département Informatique et Mathématiques, Ecole Centrale de Nantes, Nantes, France
| |
Collapse
|
11
|
A Comparison of Three Airborne Laser Scanner Types for Species Identification of Individual Trees. SENSORS 2021; 22:s22010035. [PMID: 35009577 PMCID: PMC8747214 DOI: 10.3390/s22010035] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Revised: 12/07/2021] [Accepted: 12/20/2021] [Indexed: 11/16/2022]
Abstract
Species identification is a critical factor for obtaining accurate forest inventories. This paper compares the same method of tree species identification (at the individual crown level) across three different types of airborne laser scanning systems (ALS): two linear lidar systems (monospectral and multispectral) and one single-photon lidar (SPL) system to ascertain whether current individual tree crown (ITC) species classification methods are applicable across all sensors. SPL is a new type of sensor that promises comparable point densities from higher flight altitudes, thereby increasing lidar coverage. Initial results indicate that the methods are indeed applicable across all of the three sensor types with broadly similar overall accuracies (Hardwood/Softwood, 83-90%; 12 species, 46-54%; 4 species, 68-79%), with SPL being slightly lower in all cases. The additional intensity features that are provided by multispectral ALS appear to be more beneficial to overall accuracy than the higher point density of SPL. We also demonstrate the potential contribution of lidar time-series data in improving classification accuracy (Hardwood/Softwood, 91%; 12 species, 58%; 4 species, 84%). Possible causes for lower SPL accuracy are (a) differences in the nature of the intensity features and (b) differences in first and second return distributions between the two linear systems and SPL. We also show that segmentation (and field-identified training crowns deriving from segmentation) that is performed on an initial dataset can be used on subsequent datasets with similar overall accuracy. To our knowledge, this is the first study to compare these three types of ALS systems for species identification at the individual tree level.
Collapse
|
12
|
Manjang K, Yli-Harja O, Dehmer M, Emmert-Streib F. Limitations of Explainability for Established Prognostic Biomarkers of Prostate Cancer. Front Genet 2021; 12:649429. [PMID: 34367234 PMCID: PMC8340016 DOI: 10.3389/fgene.2021.649429] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Accepted: 06/01/2021] [Indexed: 11/28/2022] Open
Abstract
High-throughput technologies do not only provide novel means for basic biological research but also for clinical applications in hospitals. For instance, the usage of gene expression profiles as prognostic biomarkers for predicting cancer progression has found widespread interest. Aside from predicting the progression of patients, it is generally believed that such prognostic biomarkers also provide valuable information about disease mechanisms and the underlying molecular processes that are causal for a disorder. However, the latter assumption has been challenged. In this paper, we study this problem for prostate cancer. Specifically, we investigate a large number of previously published prognostic signatures of prostate cancer based on gene expression profiles and show that none of these can provide unique information about the underlying disease etiology of prostate cancer. Hence, our analysis reveals that none of the studied signatures has a sensible biological meaning. Overall, this shows that all studied prognostic signatures are merely black-box models allowing sensible predictions of prostate cancer outcome but are not capable of providing causal explanations to enhance the understanding of prostate cancer.
Collapse
Affiliation(s)
- Kalifa Manjang
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
| | - Olli Yli-Harja
- Computational Systems Biology, Tampere University, Tampere, Finland.,Institute for Systems Biology, Seattle, WA, United States.,Faculty of Medicine and Health Technology, Institute of Biosciences and Medical Technology, Tampere University, Tampere, Finland
| | - Matthias Dehmer
- Department of Computer Science, Swiss Distance University of Applied Sciences, Brig, Switzerland.,Department of Mechatronics and Biomedical Computer Science, University for Health Sciences, Medical Informatics and Technology (UMIT), Hall, Austria.,College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland.,Faculty of Medicine and Health Technology, Institute of Biosciences and Medical Technology, Tampere University, Tampere, Finland
| |
Collapse
|
13
|
Cheng LH, Hsu TC, Lin C. Integrating ensemble systems biology feature selection and bimodal deep neural network for breast cancer prognosis prediction. Sci Rep 2021; 11:14914. [PMID: 34290286 PMCID: PMC8295302 DOI: 10.1038/s41598-021-92864-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 06/07/2021] [Indexed: 02/06/2023] Open
Abstract
Breast cancer is a heterogeneous disease. To guide proper treatment decisions for each patient, robust prognostic biomarkers, which allow reliable prognosis prediction, are necessary. Gene feature selection based on microarray data is an approach to discover potential biomarkers systematically. However, standard pure-statistical feature selection approaches often fail to incorporate prior biological knowledge and select genes that lack biological insights. Besides, due to the high dimensionality and low sample size properties of microarray data, selecting robust gene features is an intrinsically challenging problem. We hence combined systems biology feature selection with ensemble learning in this study, aiming to select genes with biological insights and robust prognostic predictive power. Moreover, to capture breast cancer's complex molecular processes, we adopted a multi-gene approach to predict the prognosis status using deep learning classifiers. We found that all ensemble approaches could improve feature selection robustness, wherein the hybrid ensemble approach led to the most robust result. Among all prognosis prediction models, the bimodal deep neural network (DNN) achieved the highest test performance, further verified by survival analysis. In summary, this study demonstrated the potential of combining ensemble learning and bimodal DNN in guiding precision medicine.
Collapse
Affiliation(s)
- Li-Hsin Cheng
- grid.38348.340000 0004 0532 0580Department of Electrical Engineering, National Tsing Hua University, Hsinchu, 30013 Taiwan
| | - Te-Cheng Hsu
- grid.38348.340000 0004 0532 0580Department of Electrical Engineering, National Tsing Hua University, Hsinchu, 30013 Taiwan
| | - Che Lin
- grid.19188.390000 0004 0546 0241Department of Electrical Engineering and Graduate Institute of Communication Engineering, National Taiwan University, Taipei, 10617 Taiwan
| |
Collapse
|
14
|
Virginio F, Domingues V, da Silva LCG, Andrade L, Braghetto KR, Suesdek L. WingBank: A Wing Image Database of Mosquitoes. Front Ecol Evol 2021. [DOI: 10.3389/fevo.2021.660941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Mosquito-borne diseases affect millions of people and cause thousands of deaths yearly. Vaccines have been hitherto insufficient to mitigate them, which makes mosquito control the most viable approach. But vector control depends on correct species identification and geographical assignment, and the taxonomic characters of mosquitoes are often inconspicuous to non-taxonomists, which are restricted to a life stage and/or even damaged. Thus, geometric morphometry, a low cost and precise technique that has proven to be efficient for identifying subtle morphological dissimilarities, may contribute to the resolution of these types of problems. We have been applying this technique for more than 10 years and have accumulated thousands of wing images with their metadata. Therefore, the aims of this work were to develop a prototype of a platform for the storage of biological data related to wing morphometry, by means of a relational database and a web system named “WingBank.” In order to build the WingBank prototype, a multidisciplinary team performed a gathering of requirements, modeled and designed the relational database, and implemented a web platform. WingBank was designed to enforce data completeness, to ease data query, to leverage meta-studies, and to support applications of automatic identification of mosquitoes. Currently, the database of the WingBank contains data referring to 77 species belonging to 15 genera of Culicidae. From the 13,287 wing records currently cataloged in the database, 2,138 were already made available for use by third parties. As far as we know, this is the largest database of Culicidae wings of the world.
Collapse
|
15
|
Improving prediction for medical institution with limited patient data: Leveraging hospital-specific data based on multicenter collaborative research network. Artif Intell Med 2021; 113:102024. [PMID: 33685587 DOI: 10.1016/j.artmed.2021.102024] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Revised: 11/25/2020] [Accepted: 01/18/2021] [Indexed: 12/18/2022]
Abstract
BACKGROUND AND OBJECTIVE Clinical decision support assisted by prediction models usually faces the challenges of limited clinical data and a lack of labels when the model is developed with data from a single medical institution. Accordingly, research on multicenter clinical collaborative networks, which can provide external medical data, has received increasing attention. With the increasing availability of machine learning techniques such as transfer learning, leveraging large-scale patient data from multiple hospitals to build data-driven predictive models with clinical application potential provides an alternative solution to address the problem of limited patient data. METHODS A multicenter hybrid semi-supervised transfer learning model (MHSTL) is proposed in this study on the basis of unified common data model to ensure multicenter data standardized representation. Then the hospital-specific features, along with the co-occurrence features across domains, are aligned through a representation learning architecture that is built based on deep neural networks and the newly proposed neural decision forest model. In this process, limited patient data from the target hospital, both labeled and unlabeled, are incorporated during the feature adaptation process, thereby contributing to better model performance. Without patient-level data sharing, the proposed model learning strategy which overcomes feature misalignment and distribution divergence, enables the multi-source transfer learning process in the case of insufficient and unlabeled patient data at target hospital. RESULTS The effectiveness of the proposed transfer learning model was evaluated on a collaborative research network of colorectal cancer patients in the US and China. The results demonstrate that the proposed model can achieve much better performance for predicting target risk with limited resources on patient data than baseline models . Better discrimination and calibration ability are also observed when sufficient labeled data are not available in the target hospital for prognosis prediction tasks . Further exploratory experiments show that the proposed approach exhibits good model generalizability regardless of the data heterogeneity. With the help of the SHapley Additive exPlanations for model interpretation, the effectiveness of incorporating hospital-specific features in the transfer learning model is shown. CONCLUSIONS In this study, the proposed method can develop prediction models from multiple source hospitals and exhibit good performance by leveraging cross-domain hospital-specific feature information, therefore enhancing the model prediction when applied to single medical institution with limited patient data.
Collapse
|
16
|
Manjang K, Tripathi S, Yli-Harja O, Dehmer M, Glazko G, Emmert-Streib F. Prognostic gene expression signatures of breast cancer are lacking a sensible biological meaning. Sci Rep 2021; 11:156. [PMID: 33420139 PMCID: PMC7794581 DOI: 10.1038/s41598-020-79375-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Accepted: 12/03/2020] [Indexed: 12/28/2022] Open
Abstract
The identification of prognostic biomarkers for predicting cancer progression is an important problem for two reasons. First, such biomarkers find practical application in a clinical context for the treatment of patients. Second, interrogation of the biomarkers themselves is assumed to lead to novel insights of disease mechanisms and the underlying molecular processes that cause the pathological behavior. For breast cancer, many signatures based on gene expression values have been reported to be associated with overall survival. Consequently, such signatures have been used for suggesting biological explanations of breast cancer and drug mechanisms. In this paper, we demonstrate for a large number of breast cancer signatures that such an implication is not justified. Our approach eliminates systematically all traces of biological meaning of signature genes and shows that among the remaining genes, surrogate gene sets can be formed with indistinguishable prognostic prediction capabilities and opposite biological meaning. Hence, our results demonstrate that none of the studied signatures has a sensible biological interpretation or meaning with respect to disease etiology. Overall, this shows that prognostic signatures are black-box models with sensible predictions of breast cancer outcome but no value for revealing causal connections. Furthermore, we show that the number of such surrogate gene sets is not small but very large.
Collapse
Affiliation(s)
- Kalifa Manjang
- Predictive Society and Data Analytics Lab, Tampere University, Tampere, Korkeakoulunkatu 10, 33720, Tampere, Finland
| | - Shailesh Tripathi
- Predictive Society and Data Analytics Lab, Tampere University, Tampere, Korkeakoulunkatu 10, 33720, Tampere, Finland
| | - Olli Yli-Harja
- Computational Systems Biology, Tampere University, Tampere, Korkeakoulunkatu 10, 33720, Tampere, Finland
- Institute for Systems Biology, Seattle, WA, USA
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, USA
| | - Matthias Dehmer
- Steyr School of Management, University of Applied Sciences Upper Austria, 4400 Steyr Campus, Wels, Austria
- College of Artificial Intelligence, Nankai University, Tianjin, 300350, China
- Department of Biomedical Computer Science and Mechatronics, UMIT-The Health and Life Science University, 6060 Hall in Tyrol, Innsbruck, Austria
| | - Galina Glazko
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, USA
| | - Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Tampere University, Tampere, Korkeakoulunkatu 10, 33720, Tampere, Finland.
- Institute of Biosciences and Medical Technology, Tampere University, Tampere, Korkeakoulunkatu 10, 33720, Tampere, Finland.
| |
Collapse
|
17
|
Lin PI, Moni MA, Gau SSF, Eapen V. Identifying Subgroups of Patients With Autism by Gene Expression Profiles Using Machine Learning Algorithms. Front Psychiatry 2021; 12:637022. [PMID: 34054599 PMCID: PMC8149626 DOI: 10.3389/fpsyt.2021.637022] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Accepted: 04/13/2021] [Indexed: 12/22/2022] Open
Abstract
Objectives: The identification of subgroups of autism spectrum disorder (ASD) may partially remedy the problems of clinical heterogeneity to facilitate the improvement of clinical management. The current study aims to use machine learning algorithms to analyze microarray data to identify clusters with relatively homogeneous clinical features. Methods: The whole-genome gene expression microarray data were used to predict communication quotient (SCQ) scores against all probes to select differential expression regions (DERs). Gene set enrichment analysis was performed for DERs with a fold-change >2 to identify hub pathways that play a role in the severity of social communication deficits inherent to ASD. We then used two machine learning methods, random forest classification (RF) and support vector machine (SVM), to identify two clusters using DERs. Finally, we evaluated how accurately the clusters predicted language impairment. Results: A total of 191 DERs were initially identified, and 54 of them with a fold-change >2 were selected for the pathway analysis. Cholesterol biosynthesis and metabolisms pathways appear to act as hubs that connect other trait-associated pathways to influence the severity of social communication deficits inherent to ASD. Both RF and SVM algorithms can yield a classification accuracy level >90% when all 191 DERs were analyzed. The ASD subtypes defined by the presence of language impairment, a strong indicator for prognosis, can be predicted by transcriptomic profiles associated with social communication deficits and cholesterol biosynthesis and metabolism. Conclusion: The results suggest that both RF and SVM are acceptable options for machine learning algorithms to identify AD subgroups characterized by clinical homogeneity related to prognosis.
Collapse
Affiliation(s)
- Ping-I Lin
- School of Psychiatry, The University of New South Wales, Sydney, NSW, Australia.,South Western Sydney Local Health District, Liverpool, NSW, Australia
| | - Mohammad Ali Moni
- School of Psychiatry, The University of New South Wales, Sydney, NSW, Australia
| | - Susan Shur-Fen Gau
- Department of Psychiatry, National Taiwan University Hospital and College of Medicine, Taipei, Taiwan
| | - Valsamma Eapen
- School of Psychiatry, The University of New South Wales, Sydney, NSW, Australia.,South Western Sydney Local Health District, Liverpool, NSW, Australia
| |
Collapse
|
18
|
Poppenberg KE, Tutino VM, Li L, Waqas M, June A, Chaves L, Jiang K, Jarvis JN, Sun Y, Snyder KV, Levy EI, Siddiqui AH, Kolega J, Meng H. Classification models using circulating neutrophil transcripts can detect unruptured intracranial aneurysm. J Transl Med 2020; 18:392. [PMID: 33059716 PMCID: PMC7565814 DOI: 10.1186/s12967-020-02550-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Accepted: 09/27/2020] [Indexed: 12/14/2022] Open
Abstract
Background Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods. Methods Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction. Results Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance. Conclusions We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.
Collapse
Affiliation(s)
- Kerry E Poppenberg
- Canon Stroke and Vascular Research Center, Clinical and Translational Research Center, 875 Ellicott Street, Buffalo, NY, 14214, USA.,Department of Neurosurgery, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA
| | - Vincent M Tutino
- Canon Stroke and Vascular Research Center, Clinical and Translational Research Center, 875 Ellicott Street, Buffalo, NY, 14214, USA.,Department of Biomedical Engineering, University of Buffalo, Buffalo, USA.,Department of Neurosurgery, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA.,Department of Pathology and Anatomical Sciences, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA
| | - Lu Li
- Department of Computer Science and Engineering, University of Buffalo, Buffalo, USA
| | - Muhammad Waqas
- Department of Neurosurgery, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA.,Department of Neurology, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA
| | - Armond June
- Department of Pathology and Anatomical Sciences, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA
| | - Lee Chaves
- Department of Internal Medicine, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA
| | - Kaiyu Jiang
- Genetics, Genomics, and Bioinformatics Program, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA
| | - James N Jarvis
- Genetics, Genomics, and Bioinformatics Program, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA.,Department of Pediatrics, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA
| | - Yijun Sun
- Genetics, Genomics, and Bioinformatics Program, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA.,Department of Microbiology and Immunology, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA
| | - Kenneth V Snyder
- Canon Stroke and Vascular Research Center, Clinical and Translational Research Center, 875 Ellicott Street, Buffalo, NY, 14214, USA.,Department of Neurosurgery, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA.,Department of Radiology, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA.,Department of Neurology, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA
| | - Elad I Levy
- Canon Stroke and Vascular Research Center, Clinical and Translational Research Center, 875 Ellicott Street, Buffalo, NY, 14214, USA.,Department of Neurosurgery, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA.,Department of Radiology, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA
| | - Adnan H Siddiqui
- Canon Stroke and Vascular Research Center, Clinical and Translational Research Center, 875 Ellicott Street, Buffalo, NY, 14214, USA.,Department of Neurosurgery, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA.,Department of Radiology, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA
| | - John Kolega
- Canon Stroke and Vascular Research Center, Clinical and Translational Research Center, 875 Ellicott Street, Buffalo, NY, 14214, USA.,Department of Pathology and Anatomical Sciences, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA
| | - Hui Meng
- Canon Stroke and Vascular Research Center, Clinical and Translational Research Center, 875 Ellicott Street, Buffalo, NY, 14214, USA. .,Department of Biomedical Engineering, University of Buffalo, Buffalo, USA. .,Department of Neurosurgery, Jacobs School of Medicine and Biomedical Sciences, Buffalo, USA. .,Department of Mechanical & Aerospace Engineering, University At Buffalo, Buffalo, NY, USA.
| |
Collapse
|
19
|
Narayana PA, Coronado I, Sujit SJ, Wolinsky JS, Lublin FD, Gabr RE. Deep-Learning-Based Neural Tissue Segmentation of MRI in Multiple Sclerosis: Effect of Training Set Size. J Magn Reson Imaging 2020; 51:1487-1496. [PMID: 31625650 PMCID: PMC7165037 DOI: 10.1002/jmri.26959] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2019] [Revised: 09/19/2019] [Accepted: 09/19/2019] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND The dependence of deep-learning (DL)-based segmentation accuracy of brain MRI on the training size is not known. PURPOSE To determine the required training size for a desired accuracy in brain MRI segmentation in multiple sclerosis (MS) using DL. STUDY TYPE Retrospective analysis of MRI data acquired as part of a multicenter clinical trial. STUDY POPULATION In all, 1008 patients with clinically definite MS. FIELD STRENGTH/SEQUENCE MRIs were acquired at 1.5T and 3T scanners manufactured by GE, Philips, and Siemens with dual turbo spin echo, FLAIR, and T1 -weighted turbo spin echo sequences. ASSESSMENT Segmentation results using an automated analysis pipeline and validated by two neuroimaging experts served as the ground truth. A DL model, based on a fully convolutional neural network, was trained separately using 16 different training sizes. The segmentation accuracy as a function of the training size was determined. These data were fitted to the learning curve for estimating the required training size for desired accuracy. STATISTICAL TESTS The performance of the network was evaluated by calculating the Dice similarity coefficient (DSC), and lesion true-positive and false-positive rates. RESULTS The DSC for lesions showed much stronger dependency on the sample size than gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF). When the training size was increased from 10 to 800 the DSC values varied from 0.00 to 0.86 ± 0.016 for T2 lesions, 0.87 ± 009 to 0.94 ± 0.004 for GM, 0.86 ± 0.08 to 0.94 ± 0.005 for WM, and 0.91 ± 0.009 to 0.96 ± 0.003 for CSF. DATA CONCLUSION Excellent segmentation was achieved with a training size as small as 10 image volumes for GM, WM, and CSF. In contrast, a training size of at least 50 image volumes was necessary for adequate lesion segmentation. LEVEL OF EVIDENCE 1 Technical Efficacy Stage: 1 J. Magn. Reson. Imaging 2020;51:1487-1496.
Collapse
Affiliation(s)
- Ponnada A. Narayana
- Department of Diagnostic and Interventional Imaging, McGovern Medical School, University of Texas Health Science Center, Houston, Texas, USA
| | - Ivan Coronado
- Department of Diagnostic and Interventional Imaging, McGovern Medical School, University of Texas Health Science Center, Houston, Texas, USA
| | - Sheeba J. Sujit
- Department of Diagnostic and Interventional Imaging, McGovern Medical School, University of Texas Health Science Center, Houston, Texas, USA
| | - Jerry S. Wolinsky
- Department of Neurology, McGovern Medical School, University of Texas Health Science Center, Houston, Texas, USA
| | - Fred D. Lublin
- Department of Neurology, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Refaat E. Gabr
- Department of Diagnostic and Interventional Imaging, McGovern Medical School, University of Texas Health Science Center, Houston, Texas, USA
| |
Collapse
|
20
|
A multicenter random forest model for effective prognosis prediction in collaborative clinical research network. Artif Intell Med 2020; 103:101814. [PMID: 32143809 DOI: 10.1016/j.artmed.2020.101814] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Revised: 02/04/2020] [Accepted: 02/04/2020] [Indexed: 12/17/2022]
Abstract
BACKGROUND The accuracy of a prognostic prediction model has become an essential aspect of the quality and reliability of the health-related decisions made by clinicians in modern medicine. Unfortunately, individual institutions often lack sufficient samples, which might not provide sufficient statistical power for models. One mitigation is to expand data collection from a single institution to multiple centers to collectively increase the sample size. However, sharing sensitive biomedical data for research involves complicated issues. Machine learning models such as random forests (RF), though they are commonly used and achieve good performances for prognostic prediction, usually suffer worse performance under multicenter privacy-preserving data mining scenarios compared to a centrally trained version. METHODS AND MATERIALS In this study, a multicenter random forest prognosis prediction model is proposed that enables federated clinical data mining from horizontally partitioned datasets. By using a novel data enhancement approach based on a differentially private generative adversarial network customized to clinical prognosis data, the proposed model is able to provide a multicenter RF model with performances on par with-or even better than-centrally trained RF but without the need to aggregate the raw data. Moreover, our model also incorporates an importance ranking step designed for feature selection without sharing patient-level information. RESULT The proposed model was evaluated on colorectal cancer datasets from the US and China. Two groups of datasets with different levels of heterogeneity within the collaborative research network were selected. First, we compare the performance of the distributed random forest model under different privacy parameters with different percentages of enhancement datasets and validate the effectiveness and plausibility of our approach. Then, we compare the discrimination and calibration ability of the proposed multicenter random forest with a centrally trained random forest model and other tree-based classifiers as well as some commonly used machine learning methods. The results show that the proposed model can provide better prediction performance in terms of discrimination and calibration ability than the centrally trained RF model or the other candidate models while following the privacy-preserving rules in both groups. Additionally, good discrimination and calibration ability are shown on the simplified model based on the feature importance ranking in the proposed approach. CONCLUSION The proposed random forest model exhibits ideal prediction capability using multicenter clinical data and overcomes the performance limitation arising from privacy guarantees. It can also provide feature importance ranking across institutions without pooling the data at a central site. This study offers a practical solution for building a prognosis prediction model in the collaborative clinical research network and solves practical issues in real-world applications of medical artificial intelligence.
Collapse
|
21
|
Gonzalez-Dias P, Lee EK, Sorgi S, de Lima DS, Urbanski AH, Silveira EL, Nakaya HI. Methods for predicting vaccine immunogenicity and reactogenicity. Hum Vaccin Immunother 2019; 16:269-276. [PMID: 31869262 PMCID: PMC7062420 DOI: 10.1080/21645515.2019.1697110] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Revised: 11/13/2019] [Accepted: 11/18/2019] [Indexed: 12/28/2022] Open
Abstract
Subjects receiving the same vaccine often show different levels of immune responses and some may even present adverse side effects to the vaccine. Systems vaccinology can combine omics data and machine learning techniques to obtain highly predictive signatures of vaccine immunogenicity and reactogenicity. Currently, several machine learning methods are already available to researchers with no background in bioinformatics. Here we described the four main steps to discover markers of vaccine immunogenicity and reactogenicity: (1) Preparing the data; (2) Selecting the vaccinees and relevant genes; (3) Choosing the algorithm; (4) Blind testing your model. With the increasing number of Systems Vaccinology datasets being generated, we expect that the accuracy and robustness of signatures of vaccine reactogenicity and immunogenicity will significantly improve.
Collapse
Affiliation(s)
- Patrícia Gonzalez-Dias
- Department of Clinical and Toxicological Analyses, School of Pharmaceutical Sciences, University of São Paulo, São Paulo, Brazil
| | - Eva K. Lee
- The Center for Operations Research in Medicine and HealthCare, Georgia Institute of Technology, Atlanta, GA, USA
| | - Sara Sorgi
- Department of Medical Biotechnologies, University of Siena, Siena, Italy
| | - Diógenes S. de Lima
- Department of Clinical and Toxicological Analyses, School of Pharmaceutical Sciences, University of São Paulo, São Paulo, Brazil
| | - Alysson H. Urbanski
- Department of Clinical and Toxicological Analyses, School of Pharmaceutical Sciences, University of São Paulo, São Paulo, Brazil
| | - Eduardo Lv Silveira
- Department of Clinical and Toxicological Analyses, School of Pharmaceutical Sciences, University of São Paulo, São Paulo, Brazil
| | - Helder I. Nakaya
- Department of Clinical and Toxicological Analyses, School of Pharmaceutical Sciences, University of São Paulo, São Paulo, Brazil
- Scientific Platform Pasteur, University of São Paulo, São Paulo, Brazil
| |
Collapse
|
22
|
Haveman ME, Van Putten MJAM, Hom HW, Eertman-Meyer CJ, Beishuizen A, Tjepkema-Cloostermans MC. Predicting outcome in patients with moderate to severe traumatic brain injury using electroencephalography. CRITICAL CARE : THE OFFICIAL JOURNAL OF THE CRITICAL CARE FORUM 2019; 23:401. [PMID: 31829226 PMCID: PMC6907281 DOI: 10.1186/s13054-019-2656-6] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/12/2019] [Accepted: 10/21/2019] [Indexed: 12/23/2022]
Abstract
BACKGROUND Better outcome prediction could assist in reliable quantification and classification of traumatic brain injury (TBI) severity to support clinical decision-making. We developed a multifactorial model combining quantitative electroencephalography (qEEG) measurements and clinically relevant parameters as proof of concept for outcome prediction of patients with moderate to severe TBI. METHODS Continuous EEG measurements were performed during the first 7 days of ICU admission. Patient outcome at 12 months was dichotomized based on the Extended Glasgow Outcome Score (GOSE) as poor (GOSE 1-2) or good (GOSE 3-8). Twenty-three qEEG features were extracted. Prediction models were created using a Random Forest classifier based on qEEG features, age, and mean arterial blood pressure (MAP) at 24, 48, 72, and 96 h after TBI and combinations of two time intervals. After optimization of the models, we added parameters from the International Mission for Prognosis And Clinical Trial Design (IMPACT) predictor, existing of clinical, CT, and laboratory parameters at admission. Furthermore, we compared our best models to the online IMPACT predictor. RESULTS Fifty-seven patients with moderate to severe TBI were included and divided into a training set (n = 38) and a validation set (n = 19). Our best model included eight qEEG parameters and MAP at 72 and 96 h after TBI, age, and nine other IMPACT parameters. This model had high predictive ability for poor outcome on both the training set using leave-one-out (area under the receiver operating characteristic curve (AUC) = 0.94, specificity 100%, sensitivity 75%) and validation set (AUC = 0.81, specificity 75%, sensitivity 100%). The IMPACT predictor independently predicted both groups with an AUC of 0.74 (specificity 81%, sensitivity 65%) and 0.84 (sensitivity 88%, specificity 73%), respectively. CONCLUSIONS Our study shows the potential of multifactorial Random Forest models using qEEG parameters to predict outcome in patients with moderate to severe TBI.
Collapse
Affiliation(s)
- Marjolein E Haveman
- Clinical Neurophysiology Group, University of Twente, Drienerlolaan 5, 7522 NB, Enschede, the Netherlands. .,Department of Neurology and Clinical Neurophysiology (C2), Medisch Spectrum Twente, Koningsplein 1, 7512 KZ, Enschede, the Netherlands.
| | - Michel J A M Van Putten
- Clinical Neurophysiology Group, University of Twente, Drienerlolaan 5, 7522 NB, Enschede, the Netherlands.,Department of Neurology and Clinical Neurophysiology (C2), Medisch Spectrum Twente, Koningsplein 1, 7512 KZ, Enschede, the Netherlands
| | - Harold W Hom
- Intensive Care Center, Medisch Spectrum Twente, Koningsplein 1, 7512 KZ, Enschede, the Netherlands
| | - Carin J Eertman-Meyer
- Department of Neurology and Clinical Neurophysiology (C2), Medisch Spectrum Twente, Koningsplein 1, 7512 KZ, Enschede, the Netherlands
| | - Albertus Beishuizen
- Intensive Care Center, Medisch Spectrum Twente, Koningsplein 1, 7512 KZ, Enschede, the Netherlands
| | - Marleen C Tjepkema-Cloostermans
- Clinical Neurophysiology Group, University of Twente, Drienerlolaan 5, 7522 NB, Enschede, the Netherlands.,Department of Neurology and Clinical Neurophysiology (C2), Medisch Spectrum Twente, Koningsplein 1, 7512 KZ, Enschede, the Netherlands
| |
Collapse
|
23
|
Tian Q, Zou J, Fang Y, Yu Z, Tang J, Song Y, Fan S. A Hybrid Ensemble Approach for Identifying Robust Differentially Methylated Loci in Pan-Cancers. Front Genet 2019; 10:774. [PMID: 31543899 PMCID: PMC6739624 DOI: 10.3389/fgene.2019.00774] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Accepted: 07/23/2019] [Indexed: 12/14/2022] Open
Abstract
DNA methylation is a widely investigated epigenetic mark that plays a vital role in tumorigenesis. Advancements in high-throughput assays, such as the Infinium 450K platform, provide genome-scale DNA methylation landscapes in single-CpG locus resolution, and the identification of differentially methylated loci has become an insightful approach to deepen our understanding of cancers. However, the situation with extremely unbalanced numbers of samples and loci (approximately 1:1,000) makes it rather difficult to explore differential methylation between the sick and the normal. In this article, a hybrid approach based on ensemble feature selection for identifying differentially methylated loci (HyDML) was proposed by incorporating instance perturbation and multiple function models. Experiments on data from The Cancer Genome Atlas showed that HyDML not only achieved effective DML identification, but also outperformed the single-feature selection approach in terms of classification performance and the robustness of feature selection. The intensive analysis of the DML indicated that different types of cancers have mutual patterns, and the stable DML sharing in pan-cancers is of the great potential to be biomarkers, which may strengthen the confidence of domain experts to implement biological validations.
Collapse
Affiliation(s)
- Qi Tian
- School of Automation Engineering, University of Electronic Science and Technology of China
| | - Jianxiao Zou
- School of Automation Engineering, University of Electronic Science and Technology of China
| | - Yuan Fang
- School of Automation Engineering, University of Electronic Science and Technology of China
| | - Zhongli Yu
- School of Automation Engineering, University of Electronic Science and Technology of China
| | - Jianxiong Tang
- School of Automation Engineering, University of Electronic Science and Technology of China
| | - Ying Song
- School of Automation Engineering, University of Electronic Science and Technology of China
| | - Shicai Fan
- School of Automation Engineering, University of Electronic Science and Technology of China.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
24
|
Zullig LL, Jazowski SA, Wang TY, Hellkamp A, Wojdyla D, Thomas L, Egbuonu-Davis L, Beal A, Bosworth HB. Novel application of approaches to predicting medication adherence using medical claims data. Health Serv Res 2019; 54:1255-1262. [PMID: 31429471 DOI: 10.1111/1475-6773.13200] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
OBJECTIVE To compare predictive analytic approaches to characterize medication nonadherence and determine under which circumstances each method may be best applied. DATA SOURCES/STUDY SETTING Medicare Parts A, B, and D claims from 2007 to 2013. STUDY DESIGN We evaluated three statistical techniques to predict statin adherence (proportion of days covered [PDC ≥ 80 percent]) in the year following discharge: standard logistic regression with backward selection of covariates, least absolute shrinkage and selection operator (LASSO), and random forest. We used the C-index to assess model discrimination and decile plots comparing predicted values to observed event rates to evaluate model performance. DATA EXTRACTION We identified 11 969 beneficiaries with an acute myocardial infarction (MI)-related admission from 2007 to 2012, who filled a statin prescription at, or shortly after, discharge. PRINCIPAL FINDINGS In all models, prior statin use was the most important predictor of future adherence (OR = 3.65, 95% CI: 3.34-3.98; OR = 3.55). Although the LASSO regression model selected nearly 90 percent of all candidate predictors, all three analytic approaches had moderate discrimination (C-index ranging from 0.664 to 0.673). CONCLUSIONS Although none of the models emerged as clearly superior, predictive analytics could proactively determine which patients are at risk of nonadherence, thus allowing for timely engagement in adherence-improving interventions.
Collapse
Affiliation(s)
- Leah L Zullig
- Center of Innovation to Accelerate Discovery and Practice Transformation, Durham Veterans Affairs Health Care System, Durham, North Carolina.,Department of Population Health Sciences, Duke University, Durham, North Carolina
| | - Shelley A Jazowski
- Department of Population Health Sciences, Duke University, Durham, North Carolina.,Department of Health Policy and Management, University of North Carolina, Chapel Hill, North Carolina
| | - Tracy Y Wang
- Duke Clinical Research Institute, Duke University, Durham, North Carolina
| | - Anne Hellkamp
- Duke Clinical Research Institute, Duke University, Durham, North Carolina
| | - Daniel Wojdyla
- Duke Clinical Research Institute, Duke University, Durham, North Carolina
| | - Laine Thomas
- Duke Clinical Research Institute, Duke University, Durham, North Carolina.,Department of Biostatistics and Bioinformatics, Duke University, Durham, North Carolina
| | - Lisa Egbuonu-Davis
- Global Patient Centered Outcomes and Solutions, Sanofi, New York, New York
| | - Anne Beal
- Global Patient Centered Outcomes and Solutions, Sanofi, New York, New York
| | - Hayden B Bosworth
- Center of Innovation to Accelerate Discovery and Practice Transformation, Durham Veterans Affairs Health Care System, Durham, North Carolina.,Department of Population Health Sciences, Duke University, Durham, North Carolina.,School of Nursing, Duke University, Durham, North Carolina.,Department of Psychiatry and Behavioral Sciences, Duke University, Durham, North Carolina.,Department of Medicine, Duke University, Durham, North Carolina
| |
Collapse
|
25
|
Tang J, He D, Yang P, He J, Zhang Y. Genome-wide expression profiling of glioblastoma using a large combined cohort. Sci Rep 2018; 8:15104. [PMID: 30305647 PMCID: PMC6180049 DOI: 10.1038/s41598-018-33323-z] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2018] [Accepted: 09/24/2018] [Indexed: 01/12/2023] Open
Abstract
Glioblastomas (GBMs), are the most common intrinsic brain tumors in adults and are almost universally fatal. Despite the progresses made in surgery, chemotherapy, and radiation over the past decades, the prognosis of patients with GBM remained poor and the average survival time of patients suffering from GBM was still short. Discovering robust gene signatures toward better understanding of the complex molecular mechanisms leading to GBM is an important prerequisite to the identification of novel and more effective therapeutic strategies. Herein, a comprehensive study of genome-scale mRNA expression data by combining GBM and normal tissue samples from 48 studies was performed. The 147 robust gene signatures were identified to be significantly differential expression between GBM and normal samples, among which 100 (68%) genes were reported to be closely associated with GBM in previous publications. Moreover, function annotation analysis based on these 147 robust DEGs showed certain deregulated gene expression programs (e.g., cell cycle, immune response and p53 signaling pathway) were associated with GBM development, and PPI network analysis revealed three novel hub genes (RFC4, ZWINT and TYMS) play important role in GBM development. Furthermore, survival analysis based on the TCGA GBM data demonstrated 38 robust DEGs significantly affect the prognosis of GBM in OS (p < 0.05). These findings provided new insights into molecular mechanisms underlying GBM and suggested the 38 robust DEGs could be potential targets for the diagnosis and treatment.
Collapse
Affiliation(s)
- Jing Tang
- Innovative Drug Research and Bioinformatics Group, School of Pharmaceutical Sciences and Innovative Drug Research Centre, Chongqing University, Chongqing, 401331, China.,Materia Medica Development Group, Institute of Medicinal Chemistry, Lanzhou University School of Pharmacy, Lanzhou, 730000, China
| | - Dian He
- Materia Medica Development Group, Institute of Medicinal Chemistry, Lanzhou University School of Pharmacy, Lanzhou, 730000, China. .,Gansu Institute for Drug Control, Lanzhou, 730070, China.
| | - Pingrong Yang
- Materia Medica Development Group, Institute of Medicinal Chemistry, Lanzhou University School of Pharmacy, Lanzhou, 730000, China.,Gansu Institute for Drug Control, Lanzhou, 730070, China
| | - Junquan He
- Materia Medica Development Group, Institute of Medicinal Chemistry, Lanzhou University School of Pharmacy, Lanzhou, 730000, China.,Gansu Institute for Drug Control, Lanzhou, 730070, China
| | - Yang Zhang
- Innovative Drug Research and Bioinformatics Group, School of Pharmaceutical Sciences and Innovative Drug Research Centre, Chongqing University, Chongqing, 401331, China. .,Materia Medica Development Group, Institute of Medicinal Chemistry, Lanzhou University School of Pharmacy, Lanzhou, 730000, China.
| |
Collapse
|
26
|
Affiliation(s)
- Meng Pan
- Department of Optoelectronic Engineering, College of Science and Engineering, Jinan University, Guangzhou, Guangdong, PR China
| | - Jie Zhang
- Department of Physics, College of Science and Engineering, Jinan University, Guangzhou, Guangdong, PR China
| |
Collapse
|
27
|
Chen J, Guest PC, Schwarz E. The Utility of Multiplex Assays for Identification of Proteomic Signatures in Psychiatry. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2017; 974:131-138. [DOI: 10.1007/978-3-319-52479-5_8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
|
28
|
Chen J, Schwarz E. Opportunities and Challenges of Multiplex Assays: A Machine Learning Perspective. Methods Mol Biol 2017; 1546:115-122. [PMID: 27896760 DOI: 10.1007/978-1-4939-6730-8_7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Multiplex assays that allow the simultaneous measurement of multiple analytes in small sample quantities have developed into a widely used technology. Their implementation spans across multiple assay systems and can provide readouts of similar quality as the respective single-plex measures, albeit at far higher throughput. Multiplex assay systems are therefore an important element for biomarker discovery and development strategies but analysis of the derived data can face substantial challenges that may limit the possibility of identifying meaningful biological markers. This chapter gives an overview of opportunities and challenges of multiplexed biomarker analysis, in particular from the perspective of machine learning aimed at identification of predictive biological signatures.
Collapse
Affiliation(s)
- Junfang Chen
- Department of Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, J 5, Mannheim, 68159, Germany
| | - Emanuel Schwarz
- Department of Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, J 5, Mannheim, 68159, Germany.
| |
Collapse
|
29
|
Abstract
Developing improved approaches for diagnosis, treatment, and prevention of diseases is a major goal of biomedical research. Therefore, the discovery of biomarker signatures from high-throughput "omics" data is an active research topic in the field of bioinformatics and systems medicine. A major issue is the low reproducibility and the limited biological interpretability of candidate biomarker signatures identified from high-throughput data. This impedes the use of discovered biomarker signatures into clinical applications. Currently, much focus is placed on developing strategies to improve reproducibility and interpretability. Researchers have fruitfully started to incorporate prior knowledge derived from pathways and molecular networks into the process of biomarker identification. In this chapter, after giving a general introduction to the problem of disease classification and biomarker discovery, we will review two types of network-assisted approaches: (1) approaches inferring activity scores for specific pathways which are subsequently used for classification and (2) approaches identifying subnetworks or modules of molecular networks by differential network analysis which can serve as biomarker signatures.
Collapse
|
30
|
Laas E, Mallon P, Duhoux FP, Hamidouche A, Rouzier R, Reyal F. Low Concordance between Gene Expression Signatures in ER Positive HER2 Negative Breast Carcinoma Could Impair Their Clinical Application. PLoS One 2016; 11:e0148957. [PMID: 26895349 PMCID: PMC4760978 DOI: 10.1371/journal.pone.0148957] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2015] [Accepted: 01/25/2016] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Numerous prognostic gene expression signatures have been recently described. Among the signatures there is variation in the constituent genes that are utilized. We aim to evaluate prognostic concordance among eight gene expression signatures, on a large dataset of ER positive HER2 negative breast cancers. METHODS We analysed the performance of eight gene expression signatures on six different datasets of ER+ HER2- breast cancers. Survival analyses were performed using the Kaplan-Meier estimate of survival function. We assessed discrimination and concordance between the 8 signatures on survival and recurrence rates The Nottingham Prognostic Index (NPI) was used to to stratify the risk of recurrence/death. RESULTS The discrimination ability of the whole signatures, showed fair discrimination performances, with AUC ranging from 0.64 (95%CI 0.55-0.73 for the 76-genes signatures, to 0.72 (95%CI 0.64-0.8) for the Molecular Prognosis Index T17. Low concordance was found in predicting events in the intermediate and high-risk group, as defined by the NPI. Low risk group was the only subgroup with a good signatures concordance. CONCLUSION Genomic signatures may be a good option to predict prognosis as most of them perform well at the population level. They exhibit, however, a high degree of discordance in the intermediate and high-risk groups. The major benefit that we could expect from gene expression signatures is the standardization of proliferation assessment.
Collapse
Affiliation(s)
- Enora Laas
- Institut Curie, Department of Surgery, Paris, France
- Hopital Tenon, Department of Gynaecologic Surgery, Paris, France
| | - Peter Mallon
- Institut Curie, Department of Surgery, Paris, France
- Craigavon Area Hospital Breast Unit, Portadown Northern Ireland, BT63 5QQ
| | - Francois P. Duhoux
- Institut Curie, Department of Medical Oncology, Paris, France
- Centre du Cancer, Cliniques universitaires Saint-Luc, Université catholique de Louvain, B-1200 Brussels, Belgium
| | | | - Roman Rouzier
- Institut Curie, Department of Surgery, Paris, France
| | - Fabien Reyal
- Institut Curie, Department of Surgery, Paris, France
- Hopital Tenon, Department of Gynaecologic Surgery, Paris, France
- Institut Curie, Department of Medical Oncology, Paris, France
- Centre du Cancer, Cliniques universitaires Saint-Luc, Université catholique de Louvain, B-1200 Brussels, Belgium
- Institut Curie, Translational Research Department, Residual Tumor and Response to Treatment, RT2Lab, Paris, France
- Institut Curie, UMR932, Immunity and Cancer, Paris, France
- * E-mail:
| |
Collapse
|
31
|
Wang H, Yang F, Luo Z. An experimental study of the intrinsic stability of random forest variable importance measures. BMC Bioinformatics 2016; 17:60. [PMID: 26842629 PMCID: PMC4739337 DOI: 10.1186/s12859-016-0900-5] [Citation(s) in RCA: 90] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2015] [Accepted: 12/15/2015] [Indexed: 12/27/2022] Open
Abstract
Background The stability of Variable Importance Measures (VIMs) based on random forest has recently received increased attention. Despite the extensive attention on traditional stability of data perturbations or parameter variations, few studies include influences coming from the intrinsic randomness in generating VIMs, i.e. bagging, randomization and permutation. To address these influences, in this paper we introduce a new concept of intrinsic stability of VIMs, which is defined as the self-consistence among feature rankings in repeated runs of VIMs without data perturbations and parameter variations. Two widely used VIMs, i.e., Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG) are comprehensively investigated. The motivation of this study is two-fold. First, we empirically verify the prevalence of intrinsic stability of VIMs over many real-world datasets to highlight that the instability of VIMs does not originate exclusively from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. Second, through Spearman and Pearson tests we comprehensively investigate how different factors influence the intrinsic stability. Results The experiments are carried out on 19 benchmark datasets with diverse characteristics, including 10 high-dimensional and small-sample gene expression datasets. Experimental results demonstrate the prevalence of intrinsic stability of VIMs. Spearman and Pearson tests on the correlations between intrinsic stability and different factors show that #feature (number of features) and #sample (size of sample) have a coupling effect on the intrinsic stability. The synthetic indictor, #feature/#sample, shows both negative monotonic correlation and negative linear correlation with the intrinsic stability, while OOB accuracy has monotonic correlations with intrinsic stability. This indicates that high-dimensional, small-sample and high complexity datasets may suffer more from intrinsic instability of VIMs. Furthermore, with respect to parameter settings of random forest, a large number of trees is preferred. No significant correlations can be seen between intrinsic stability and other factors. Finally, the magnitude of intrinsic stability is always smaller than that of traditional stability. Conclusion First, the prevalence of intrinsic stability of VIMs demonstrates that the instability of VIMs not only comes from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. This finding gives a better understanding of VIM stability, and may help reduce the instability of VIMs. Second, by investigating the potential factors of intrinsic stability, users would be more aware of the risks and hence more careful when using VIMs, especially on high-dimensional, small-sample and high complexity datasets.
Collapse
Affiliation(s)
- Huazhen Wang
- College of Computer Science and Technology, Huaqiao University, Jimei Avenue, Xiamen, 361021, China. .,Computer Learning Research Centre, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, UK.
| | - Fan Yang
- Automation Department, Xiamen University, Siming South Road, Xiamen, 361005, China.
| | - Zhiyuan Luo
- Computer Learning Research Centre, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, UK.
| |
Collapse
|
32
|
Kamkar I, Gupta SK, Phung D, Venkatesh S. Stabilizing l1-norm prediction models by supervised feature grouping. J Biomed Inform 2015; 59:149-68. [PMID: 26689771 DOI: 10.1016/j.jbi.2015.11.012] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2015] [Revised: 11/18/2015] [Accepted: 11/23/2015] [Indexed: 01/05/2023]
Abstract
Emerging Electronic Medical Records (EMRs) have reformed the modern healthcare. These records have great potential to be used for building clinical prediction models. However, a problem in using them is their high dimensionality. Since a lot of information may not be relevant for prediction, the underlying complexity of the prediction models may not be high. A popular way to deal with this problem is to employ feature selection. Lasso and l1-norm based feature selection methods have shown promising results. But, in presence of correlated features, these methods select features that change considerably with small changes in data. This prevents clinicians to obtain a stable feature set, which is crucial for clinical decision making. Grouping correlated variables together can improve the stability of feature selection, however, such grouping is usually not known and needs to be estimated for optimal performance. Addressing this problem, we propose a new model that can simultaneously learn the grouping of correlated features and perform stable feature selection. We formulate the model as a constrained optimization problem and provide an efficient solution with guaranteed convergence. Our experiments with both synthetic and real-world datasets show that the proposed model is significantly more stable than Lasso and many existing state-of-the-art shrinkage and classification methods. We further show that in terms of prediction performance, the proposed method consistently outperforms Lasso and other baselines. Our model can be used for selecting stable risk factors for a variety of healthcare problems, so it can assist clinicians toward accurate decision making.
Collapse
Affiliation(s)
- Iman Kamkar
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| | - Sunil Kumar Gupta
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| | - Dinh Phung
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| | - Svetha Venkatesh
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| |
Collapse
|
33
|
Lai HM, Albrecht AA, Steinhöfel KK. iRDA: a new filter towards predictive, stable, and enriched candidate genes. BMC Genomics 2015; 16:1041. [PMID: 26647162 PMCID: PMC4673793 DOI: 10.1186/s12864-015-2129-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2015] [Accepted: 10/22/2015] [Indexed: 11/28/2022] Open
Abstract
Background Gene expression profiling using high-throughput screening (HTS) technologies allows clinical researchers to find prognosis gene signatures that could better discriminate between different phenotypes and serve as potential biological markers in disease diagnoses. In recent years, many feature selection methods have been devised for finding such discriminative genes, and more recently information theoretic filters have also been introduced for capturing feature-to-class relevance and feature-to-feature correlations in microarray-based classification. Methods In this paper, we present and fully formulate a new multivariate filter, iRDA, for the discovery of HTS gene-expression candidate genes. The filter constitutes a four-step framework and includes feature relevance, feature redundancy, and feature interdependence in the context of feature-pairs. The method is based upon approximate Markov blankets, information theory, several heuristic search strategies with forward, backward and insertion phases, and the method is aiming at higher order gene interactions. Results To show the strengths of iRDA, three performance measures, two evaluation schemes, two stability index sets, and the gene set enrichment analysis (GSEA) are all employed in our experimental studies. Its effectiveness has been validated by using seven well-known cancer gene-expression benchmarks and four other disease experiments, including a comparison to three popular information theoretic filters. In terms of classification performance, candidate genes selected by iRDA perform better than the sets discovered by the other three filters. Two stability measures indicate that iRDA is the most robust with the least variance. GSEA shows that iRDA produces more statistically enriched gene sets on five out of the six benchmark datasets. Conclusions Through the classification performance, the stability performance, and the enrichment analysis, iRDA is a promising filter to find predictive, stable, and enriched gene-expression candidate genes. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2129-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hung-Ming Lai
- Algorithms and Bioinformatics Research Group, Department of Informatics, King's College London, Strand, London, WC2R 2LS, UK.
| | - Andreas A Albrecht
- School of Science and Technology, Middlesex University, Burroughs, London, NW4 4BT, UK.
| | - Kathleen K Steinhöfel
- Algorithms and Bioinformatics Research Group, Department of Informatics, King's College London, Strand, London, WC2R 2LS, UK.
| |
Collapse
|
34
|
Park J, Lee J, Choi C. Evaluation of drug-targetable genes by defining modes of abnormality in gene expression. Sci Rep 2015; 5:13576. [PMID: 26336805 PMCID: PMC4559746 DOI: 10.1038/srep13576] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2015] [Accepted: 07/31/2015] [Indexed: 12/25/2022] Open
Abstract
In the post-genomic era, many researchers have taken a systematic approach to identifying abnormal genes associated with various diseases. However, the gold standard has not been established, and most of these abnormalities are difficult to be rehabilitated in real clinical settings. In addition to identifying abnormal genes, for a practical purpose, it is necessary to investigate abnormality diversity. In this context, this study is aimed to demonstrate simply restorable genes as useful drug targets. We devised the concept of “drug targetability” to evaluate several different modes of abnormal genes by predicting events after drug treatment. As a representative example, we applied our method to breast cancer. Computationally, PTPRF, PRKAR2B, MAP4K3, and RICTOR were calculated as highly drug-targetable genes for breast cancer. After knockdown of these top-ranked genes (i.e., high drug targetability) using siRNA, our predictions were validated by cell death and migration assays. Moreover, inhibition of RICTOR or PTPRF was expected to prolong lifespan of breast cancer patients according to patient information annotated in microarray data. We anticipate that our method can be widely applied to elaborate selection of novel drug targets, and, ultimately, to improve the efficacy of disease treatment.
Collapse
Affiliation(s)
- Junseong Park
- Department of Bio and Brain Engineering, KAIST, Daejeon, 305-701, Republic of Korea
| | - Jungsul Lee
- Department of Bio and Brain Engineering, KAIST, Daejeon, 305-701, Republic of Korea
| | - Chulhee Choi
- Department of Bio and Brain Engineering, KAIST, Daejeon, 305-701, Republic of Korea.,KAIST Institute for the BioCentury, KAIST, Daejeon, 305-701, Republic of Korea
| |
Collapse
|
35
|
Lai HM, Özturk C, Albrecht A, Steinhöfel K. A new vision of evaluating gene expression signatures. Comput Biol Chem 2015; 57:54-60. [PMID: 25748535 DOI: 10.1016/j.compbiolchem.2015.02.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2015] [Accepted: 02/03/2015] [Indexed: 10/23/2022]
Abstract
Gene expression profiles based on high-throughput technologies contribute to molecular classifications of different cell lines and consequently to clinical diagnostic tests for cancer types and other diseases. Statistical techniques and dimension reduction methods have been devised for identifying minimal gene subset with maximal discriminative power. For sets of in silico candidate genes, assuming a unique gene signature or performing a parsimonious signature evaluation seems to be too restrictive in the context of in vitro signature validation. This is mainly due to the high complexity of largely correlated expression measurements and the existence of various oncogenic pathways. Consequently, it might be more advantageous to identify and evaluate multiple gene signatures with a similar good predictive power, which are referred to as near-optimal signatures, to be made available for biological validation. For this purpose we propose the bead-chain-plot approach originating from swarm intelligence techniques, and a small scale computational experiment is conducted in order to convey our vision. We simulate the acquisition of candidate genes by using a small pool of differentially expressed genes derived from microarray-based CNS tumour data. The application of the bead-chain-plot provides experimental evidence for improved classifications by using near-optimal signatures in validation procedures.
Collapse
Affiliation(s)
- Hung-Ming Lai
- Algorithms and Bioinformatics Research Group, Department of Informatics, King's College London, Strand, London WC2R 2LS, UK.
| | - Celal Özturk
- Department of Computer Engineering, Faculty of Engineering, Erciyes University, Kayseri 38039, Turkey.
| | - Andreas Albrecht
- School of Science and Technology, Middlesex University, Burroughs, London NW4 4BT, UK.
| | - Kathleen Steinhöfel
- Algorithms and Bioinformatics Research Group, Department of Informatics, King's College London, Strand, London WC2R 2LS, UK.
| |
Collapse
|
36
|
Vo NS, Phan V. Exploiting dependencies of pairwise comparison outcomes to predict patterns of gene response. BMC Bioinformatics 2014; 15 Suppl 11:S2. [PMID: 25350806 PMCID: PMC4251046 DOI: 10.1186/1471-2105-15-s11-s2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The analysis of gene expression has played an important role in medical and bioinformatics research. Although it is known that a large number of samples is needed to determine the patterns of gene expression accurately, practical designs of gene expression studies occasionally have insufficient numbers of samples, making it difficult to ascertain true response patterns of variantly expressed genes. RESULTS We describe an approach to cope with the challenge of predicting true orders of gene response to treatments. We show that true patterns of gene response must be orderable sets. In experiments with few samples, we modify the conventional pairwise comparison tests and increase the significance level α intelligently to deduce orderable patterns, which are most likely true orders of gene response. Additionally, motivated by the fact that a gene can be involved in multiple biological functions, our method further resamples experimental replicates and predicts multiple response patterns for each gene. CONCLUSIONS This method can be useful in designing cost-effective experiments with small sample sizes. Patterns of highly-variantly expressed genes can be predicted by varying α intelligently. Furthermore, clusters are labeled meaningfully with patterns that describe precisely how genes in such clusters respond to treatments.
Collapse
|
37
|
Feng L, Wang J, Cao B, Zhang Y, Wu B, Di X, Jiang W, An N, Lu D, Gao S, Zhao Y, Chen Z, Mao Y, Gao Y, Zhou D, Jen J, Liu X, Zhang Y, Li X, Zhang K, He J, Cheng S. Gene expression profiling in human lung development: an abundant resource for lung adenocarcinoma prognosis. PLoS One 2014; 9:e105639. [PMID: 25141350 PMCID: PMC4139381 DOI: 10.1371/journal.pone.0105639] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2014] [Accepted: 07/22/2014] [Indexed: 02/05/2023] Open
Abstract
A tumor can be viewed as a special “organ” that undergoes aberrant and poorly regulated organogenesis. Progress in cancer prognosis and therapy might be facilitated by re-examining distinctive processes that operate during normal development, to elucidate the intrinsic features of cancer that are significantly obscured by its heterogeneity. The global gene expression signatures of 44 human lung tissues at four development stages from Asian descent and 69 lung adenocarcinoma (ADC) tissue samples from ethnic Chinese patients were profiled using microarrays. All of the genes were classified into 27 distinct groups based on their expression patterns (named as PTN1 to PTN27) during the developmental process. In lung ADC, genes whose expression levels decreased steadily during lung development (genes in PTN1) generally had their expression reactivated, while those with uniformly increasing expression levels (genes in PTN27) had their expression suppressed. The genes in PTN1 contain many n-gene signatures that are of prognostic value for lung ADC. The prognostic relevance of a 12-gene demonstrator for patient survival was characterized in five cohorts of healthy and ADC patients [ADC_CICAMS (n = 69, p = 0.007), ADC_PNAS (n = 125, p = 0.0063), ADC_GSE13213 (n = 117, p = 0.0027), ADC_GSE8894 (n = 62, p = 0.01), and ADC_NCI (n = 282, p = 0.045)] and in four groups of stage I patients [ADC_CICAMS (n = 22, p = 0.017), ADC_PNAS (n = 76, p = 0.018), ADC_GSE13213 (n = 79, p = 0.02), and ADC_qPCR (n = 62, p = 0.006)]. In conclusion, by comparison of gene expression profiles during human lung developmental process and lung ADC progression, we revealed that the genes with a uniformly decreasing expression pattern during lung development are of enormous prognostic value for lung ADC.
Collapse
Affiliation(s)
- Lin Feng
- State Key Laboratory of Molecular Oncology, Department of Etiology and Carcinogenesis, Cancer Hospital and Institute, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, China
| | - Jiamei Wang
- Department of Gynaecology and Obstetrics, Maternal & Child Health Care hospital of Haidian, Beijing, China
| | - Bangrong Cao
- State Key Laboratory of Molecular Oncology, Department of Etiology and Carcinogenesis, Cancer Hospital and Institute, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, China
| | - Yi Zhang
- Departments of Thoracic Surgery, Xuanwu Hospital, Capital Medical University, Beijing, China
| | - Bo Wu
- Department of Histology and Embryology, School of Basic Medical Sciences, Capital Medical University, Beijing, China
| | - Xuebing Di
- State Key Laboratory of Molecular Oncology, Department of Etiology and Carcinogenesis, Cancer Hospital and Institute, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, China
| | - Wei Jiang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Ning An
- State Key Laboratory of Molecular Oncology, Department of Etiology and Carcinogenesis, Cancer Hospital and Institute, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, China
| | - Dan Lu
- State Key Laboratory of Molecular Oncology, Department of Etiology and Carcinogenesis, Cancer Hospital and Institute, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, China
| | - Suhong Gao
- Department of Gynaecology and Obstetrics, Maternal & Child Health Care hospital of Haidian, Beijing, China
| | - Yuda Zhao
- Departments of Thoracic Surgery, Cancer Hospital and Institute, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, China
| | - Zhaoli Chen
- Departments of Thoracic Surgery, Cancer Hospital and Institute, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, China
| | - Yousheng Mao
- Departments of Thoracic Surgery, Cancer Hospital and Institute, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, China
| | - Yanning Gao
- State Key Laboratory of Molecular Oncology, Department of Etiology and Carcinogenesis, Cancer Hospital and Institute, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, China
| | - Deshan Zhou
- Department of Histology and Embryology, School of Basic Medical Sciences, Capital Medical University, Beijing, China
| | - Jin Jen
- Medical Genome Facility, and the Department of Laboratory Medicine and Pathology, Mayo Clinic. Rochester, Minnesota, United States of America
| | - Xiaohong Liu
- Department of Gynaecology and Obstetrics, Maternal & Child Health Care hospital of Haidian, Beijing, China
| | - Yunping Zhang
- Department of Gynaecology and Obstetrics, Maternal & Child Health Care hospital of Haidian, Beijing, China
| | - Xia Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Kaitai Zhang
- State Key Laboratory of Molecular Oncology, Department of Etiology and Carcinogenesis, Cancer Hospital and Institute, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, China
- * E-mail: (KZ); (JH); (SC)
| | - Jie He
- Departments of Thoracic Surgery, Cancer Hospital and Institute, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, China
- * E-mail: (KZ); (JH); (SC)
| | - Shujun Cheng
- State Key Laboratory of Molecular Oncology, Department of Etiology and Carcinogenesis, Cancer Hospital and Institute, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, China
- * E-mail: (KZ); (JH); (SC)
| |
Collapse
|
38
|
Sharma P, Stecklein SR, Kimler BF, Sethi G, Petroff BK, Phillips TA, Tawfik OW, Godwin AK, Jensen RA. The prognostic value of BRCA1 promoter methylation in early stage triple negative breast cancer. ACTA ACUST UNITED AC 2014; 3:1-11. [PMID: 25177489 PMCID: PMC4147783 DOI: 10.7243/2049-7962-3-2] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Introduction Methylation of the BRCA1 promoter is frequent in triple negative breast cancers (TNBC) and results in a tumor phenotype similar to BRCA1-mutated tumors. BRCA1 mutation-associated cancers are more sensitive to DNA damaging agents as compared to conventional chemotherapy agents. It is not known if there is an interaction between the presence of BRCA1 promoter methylation (PM) and response to chemotherapy agents in sporadic TNBC. We sought to investigate the prognostic significance of BRCA1 PM in TNBC patients receiving standard chemotherapy. Methods Subjects with stage I-III TNBC treated with chemotherapy were identified and their formalin-fixed paraffin-embedded (FFPE) tumor specimens retrieved. Genomic DNA was isolated and subjected to methylation-specific PCR (MSPCR). Results DNA was isolated from primary tumor of 39 subjects. BRCA1 PM was detected in 30% of patients. Presence of BRCA1 PM was associated with lower BRCA1 transcript levels, suggesting epigenetic BRCA1 silencing. All patients received chemotherapy (anthracycline:90%, taxane:69%). At a median follow-up of 64 months, 46% of patients have recurred and 36% have died. On univariate analysis, African-American race, node positivity, stage, and BRCA1 PM were associated with worse RFS and OS. Five year OS was 36% for patients with BRCA1 PM vs. 77% for patients without BRCA1 PM (p=0.004). On multivariable analysis, BRCA1 PM was associated with significantly worse RFS and OS. Conclusions We show that BRCA1 PM is common in TNBC and has the potential to identify a significant fraction of TNBC patients who have suboptimal outcomes with standard chemotherapy.
Collapse
Affiliation(s)
- Priyanka Sharma
- Division of Hematology/Oncology, Department of Internal Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Shane R Stecklein
- Department of Pathology and Laboratory Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA.,The University of Kansas Cancer Center, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Bruce F Kimler
- Department of Radiation Oncology, University of Kansas Medical Center, Kansas City, Kansas, USA.,Breast Cancer Prevention Center, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Geetika Sethi
- Department of Pathology and Laboratory Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA.,Department of Biochemistry and Molecular Biology, Drexel University College of Medicine, Philadelphia, Pennsylvania, USA
| | - Brian K Petroff
- Division of Hematology/Oncology, Department of Internal Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA.,Breast Cancer Prevention Center, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Teresa A Phillips
- Division of Hematology/Oncology, Department of Internal Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA.,Breast Cancer Prevention Center, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Ossama W Tawfik
- Department of Pathology and Laboratory Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA.,The University of Kansas Cancer Center, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Andrew K Godwin
- Department of Pathology and Laboratory Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA.,The University of Kansas Cancer Center, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Roy A Jensen
- Department of Pathology and Laboratory Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA.,The University of Kansas Cancer Center, University of Kansas Medical Center, Kansas City, Kansas, USA
| |
Collapse
|
39
|
Abstract
PURPOSE OF REVIEW As the induction and maintenance of donor-specific tolerance is a central aim in solid organ transplantation, it is essential that clinicians are able to identify and monitor tolerance accurately and reliably. This review highlights recent advances in defining sets of biomarkers in noninvasive samples that may guide minimization and withdrawal of immunosuppression in tolerant recipients. RECENT FINDINGS Recent studies in liver and kidney transplant recipients have identified distinct biomarker profiles that are associated with operational tolerance. Although there is some heterogeneity in the findings of these studies, these have suggested novel cellular mechanisms for the development of tolerance. SUMMARY Multiple platforms such as microarray gene expression analysis, flow cytometry, and immune cell functional assays have been used to discover and validate composite sets of biomarkers, which identify recipients with operational tolerance both in liver and kidney transplantation. These studies suggest that distinct cellular and molecular mechanisms lead to the development of tolerance in different transplanted organs. These putative biomarker profiles now need to be validated prospectively in trials of immunosuppression withdrawal and in novel approaches to induce transplant tolerance.
Collapse
|
40
|
Stretch C, Khan S, Asgarian N, Eisner R, Vaisipour S, Damaraju S, Graham K, Bathe OF, Steed H, Greiner R, Baracos VE. Effects of sample size on differential gene expression, rank order and prediction accuracy of a gene signature. PLoS One 2013; 8:e65380. [PMID: 23755224 PMCID: PMC3670871 DOI: 10.1371/journal.pone.0065380] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2012] [Accepted: 04/24/2013] [Indexed: 12/26/2022] Open
Abstract
Top differentially expressed gene lists are often inconsistent between studies and it has been suggested that small sample sizes contribute to lack of reproducibility and poor prediction accuracy in discriminative models. We considered sex differences (69♂, 65♀) in 134 human skeletal muscle biopsies using DNA microarray. The full dataset and subsamples (n = 10 (5♂, 5♀) to n = 120 (60♂, 60♀)) thereof were used to assess the effect of sample size on the differential expression of single genes, gene rank order and prediction accuracy. Using our full dataset (n = 134), we identified 717 differentially expressed transcripts (p<0.0001) and we were able predict sex with ∼90% accuracy, both within our dataset and on external datasets. Both p-values and rank order of top differentially expressed genes became more variable using smaller subsamples. For example, at n = 10 (5♂, 5♀), no gene was considered differentially expressed at p<0.0001 and prediction accuracy was ∼50% (no better than chance). We found that sample size clearly affects microarray analysis results; small sample sizes result in unstable gene lists and poor prediction accuracy. We anticipate this will apply to other phenotypes, in addition to sex.
Collapse
Affiliation(s)
- Cynthia Stretch
- Department of Oncology, University of Alberta, Cross Cancer Institute, Edmonton, Alberta, Canada
| | - Sheehan Khan
- Department of Computing Science, University of Alberta, Edmonton, AB, Canada
| | - Nasimeh Asgarian
- Department of Computing Science, University of Alberta, Edmonton, AB, Canada
- Alberta Innovates Centre for Machine Learning, Edmonton, AB, Canada
| | - Roman Eisner
- Department of Computing Science, University of Alberta, Edmonton, AB, Canada
- Alberta Innovates Centre for Machine Learning, Edmonton, AB, Canada
| | - Saman Vaisipour
- Department of Computing Science, University of Alberta, Edmonton, AB, Canada
- Alberta Innovates Centre for Machine Learning, Edmonton, AB, Canada
| | - Sambasivarao Damaraju
- Department of Oncology, University of Alberta, Cross Cancer Institute, Edmonton, Alberta, Canada
- Department of Laboratory Medicine and Pathology, University of Alberta, Edmonton, AB, Canada
| | - Kathryn Graham
- Department of Oncology, University of Alberta, Cross Cancer Institute, Edmonton, Alberta, Canada
| | - Oliver F. Bathe
- Department of Oncology, University of Calgary, Calgary, Alberta, Canada
- Department of Surgery, University of Calgary, Calgary, Alberta, Canada
| | - Helen Steed
- Department of Oncology, University of Alberta, Cross Cancer Institute, Edmonton, Alberta, Canada
| | - Russell Greiner
- Department of Computing Science, University of Alberta, Edmonton, AB, Canada
- Alberta Innovates Centre for Machine Learning, Edmonton, AB, Canada
| | - Vickie E. Baracos
- Department of Oncology, University of Alberta, Cross Cancer Institute, Edmonton, Alberta, Canada
- * E-mail:
| |
Collapse
|
41
|
Burton M, Thomassen M, Tan Q, Kruse TA. Gene expression profiles for predicting metastasis in breast cancer: a cross-study comparison of classification methods. ScientificWorldJournal 2012; 2012:380495. [PMID: 23251101 PMCID: PMC3515909 DOI: 10.1100/2012/380495] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2012] [Accepted: 10/02/2012] [Indexed: 12/20/2022] Open
Abstract
Machine learning has increasingly been used with microarray gene expression data and for the development of classifiers using a variety of methods. However, method comparisons in cross-study datasets are very scarce. This study compares the performance of seven classification methods and the effect of voting for predicting metastasis outcome in breast cancer patients, in three situations: within the same dataset or across datasets on similar or dissimilar microarray platforms. Combining classification results from seven classifiers into one voting decision performed significantly better during internal validation as well as external validation in similar microarray platforms than the underlying classification methods. When validating between different microarray platforms, random forest, another voting-based method, proved to be the best performing method. We conclude that voting based classifiers provided an advantage with respect to classifying metastasis outcome in breast cancer patients.
Collapse
Affiliation(s)
- Mark Burton
- Research Unit of Human Genetics, Institute of Clinical Research, University of Southern Denmark, Sdr. Boulevard 29, 5000 Odense C, Denmark.
| | | | | | | |
Collapse
|
42
|
Chen YK, Li KB. Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou's pseudo amino acid composition. J Theor Biol 2012; 318:1-12. [PMID: 23137835 DOI: 10.1016/j.jtbi.2012.10.033] [Citation(s) in RCA: 98] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2012] [Revised: 10/25/2012] [Accepted: 10/26/2012] [Indexed: 01/04/2023]
Abstract
The type information of un-annotated membrane proteins provides an important hint for their biological functions. The experimental determination of membrane protein types, despite being more accurate and reliable, is not always feasible due to the costly laboratory procedures, thereby creating a need for the development of bioinformatics methods. This article describes a novel computational classifier for the prediction of membrane protein types using proteins' sequences. The classifier, comprising a collection of one-versus-one support vector machines, makes use of the following sequence attributes: (1) the cationic patch sizes, the orientation, and the topology of transmembrane segments; (2) the amino acid physicochemical properties; (3) the presence of signal peptides or anchors; and (4) the specific protein motifs. A new voting scheme was implemented to cope with the multi-class prediction. Both the training and the testing sequences were collected from SwissProt. Homologous proteins were removed such that there is no pair of sequences left in the datasets with a sequence identity higher than 40%. The performance of the classifier was evaluated by a Jackknife cross-validation and an independent testing experiments. Results show that the proposed classifier outperforms earlier predictors in prediction accuracy in seven of the eight membrane protein types. The overall accuracy was increased from 78.3% to 88.2%. Unlike earlier approaches which largely depend on position-specific substitution matrices and amino acid compositions, most of the sequence attributes implemented in the proposed classifier have supported literature evidences. The classifier has been deployed as a web server and can be accessed at http://bsaltools.ym.edu.tw/predmpt.
Collapse
Affiliation(s)
- Yen-Kuang Chen
- Institute of Biomedical Informatics, National Yang-Ming University, No.155, Sec 2, Lih-Nong Street, Taipei, 112, Taiwan, ROC
| | | |
Collapse
|
43
|
Wang D, Zhang Y, Huang Y, Li P, Wang M, Wu R, Cheng L, Zhang W, Zhang Y, Li B, Wang C, Guo Z. Comparison of different normalization assumptions for analyses of DNA methylation data from the cancer genome. Gene 2012; 506:36-42. [PMID: 22771920 DOI: 10.1016/j.gene.2012.06.075] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2011] [Revised: 06/21/2012] [Accepted: 06/22/2012] [Indexed: 01/02/2023]
Abstract
Nowadays, some researchers normalized DNA methylation arrays data in order to remove the technical artifacts introduced by experimental differences in sample preparation, array processing and other factors. However, other researchers analyzed DNA methylation arrays without performing data normalization considering that current normalizations for methylation data may distort real differences between normal and cancer samples because cancer genomes may be extensively subject to hypomethylation and the total amount of CpG methylation might differ substantially among samples. In this study, using eight datasets by Infinium HumanMethylation27 assay, we systemically analyzed the global distribution of DNA methylation changes in cancer compared to normal control and its effect on data normalization for selecting differentially methylated (DM) genes. We showed more differentially methylated (DM) genes could be found in the Quantile/Lowess-normalized data than in the non-normalized data. We found the DM genes additionally selected in the Quantile/Lowess-normalized data showed significantly consistent methylation states in another independent dataset for the same cancer, indicating these extra DM genes were effective biological signals related to the disease. These results suggested normalization can increase the power of detecting DM genes in the context of diagnostic markers which were usually characterized by relatively large effect sizes. Besides, we evaluated the reproducibility of DM discoveries for a particular cancer type, and we found most of the DM genes additionally detected in one dataset showed the same methylation directions in the other dataset for the same cancer type, indicating that these DM genes were effective biological signals in the other dataset. Furthermore, we showed that some DM genes detected from different studies for a particular cancer type were significantly reproducible at the functional level.
Collapse
Affiliation(s)
- Dong Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
44
|
Londoño MC, Danger R, Giral M, Soulillou JP, Sánchez-Fueyo A, Brouard S. A need for biomarkers of operational tolerance in liver and kidney transplantation. Am J Transplant 2012; 12:1370-7. [PMID: 22486792 DOI: 10.1111/j.1600-6143.2012.04035.x] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Both kidney and particularly liver recipients can occasionally discontinue all immunosuppressive drugs without undergoing rejection. These patients, who maintain stable graft function off immunosuppressive drugs without clinically significant detrimental immune responses and/or immune deficits, are conventionally termed operationally tolerant and offer a unique paradigm of tolerance in humans. The immune characterization of operationally tolerant transplant recipients has recently received substantial attention. Operationally tolerant patients might exhibit a signature of tolerance that could potentially be useful to select recipients amenable to drug minimization or withdrawal. Furthermore, elucidation of the molecular pathways associated with the operational tolerance phenotype could provide novel targets for therapy. Particular emphasis has been placed on the use of blood samples and high-throughput transcriptional profiling techniques. In liver transplantation, natural killer related transcripts seem to be the most robust markers of operational tolerance, whereas enrichment in B cell-related gene expression markers has been consistently found in blood samples from operationally tolerant kidney recipients, suggesting that different mechanisms operate in the two situations. In this minireview, we summarize the main achievements of recently published reports focused on the identification of transcriptional markers of operational tolerance, we highlight their mechanistic and clinical implications and describe their methodological limitations.
Collapse
Affiliation(s)
- M-C Londoño
- Liver Transplant Unit, Hospital Clinic, IDIBAPS, CIBEREHD, Barcelona, Spain
| | | | | | | | | | | |
Collapse
|
45
|
Busser BW, Taher L, Kim Y, Tansey T, Bloom MJ, Ovcharenko I, Michelson AM. A machine learning approach for identifying novel cell type-specific transcriptional regulators of myogenesis. PLoS Genet 2012; 8:e1002531. [PMID: 22412381 PMCID: PMC3297574 DOI: 10.1371/journal.pgen.1002531] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2011] [Accepted: 12/23/2011] [Indexed: 12/22/2022] Open
Abstract
Transcriptional enhancers integrate the contributions of multiple classes of transcription factors (TFs) to orchestrate the myriad spatio-temporal gene expression programs that occur during development. A molecular understanding of enhancers with similar activities requires the identification of both their unique and their shared sequence features. To address this problem, we combined phylogenetic profiling with a DNA-based enhancer sequence classifier that analyzes the TF binding sites (TFBSs) governing the transcription of a co-expressed gene set. We first assembled a small number of enhancers that are active in Drosophila melanogaster muscle founder cells (FCs) and other mesodermal cell types. Using phylogenetic profiling, we increased the number of enhancers by incorporating orthologous but divergent sequences from other Drosophila species. Functional assays revealed that the diverged enhancer orthologs were active in largely similar patterns as their D. melanogaster counterparts, although there was extensive evolutionary shuffling of known TFBSs. We then built and trained a classifier using this enhancer set and identified additional related enhancers based on the presence or absence of known and putative TFBSs. Predicted FC enhancers were over-represented in proximity to known FC genes; and many of the TFBSs learned by the classifier were found to be critical for enhancer activity, including POU homeodomain, Myb, Ets, Forkhead, and T-box motifs. Empirical testing also revealed that the T-box TF encoded by org-1 is a previously uncharacterized regulator of muscle cell identity. Finally, we found extensive diversity in the composition of TFBSs within known FC enhancers, suggesting that motif combinatorics plays an essential role in the cellular specificity exhibited by such enhancers. In summary, machine learning combined with evolutionary sequence analysis is useful for recognizing novel TFBSs and for facilitating the identification of cognate TFs that coordinate cell type-specific developmental gene expression patterns.
Collapse
Affiliation(s)
- Brian W. Busser
- Laboratory of Developmental Systems Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Leila Taher
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Yongsok Kim
- Laboratory of Developmental Systems Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Terese Tansey
- Laboratory of Developmental Systems Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Molly J. Bloom
- Laboratory of Developmental Systems Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Ivan Ovcharenko
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
- * E-mail: (IO); (AMM)
| | - Alan M. Michelson
- Laboratory of Developmental Systems Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
- * E-mail: (IO); (AMM)
| |
Collapse
|
46
|
Di Camillo B, Sanavia T, Martini M, Jurman G, Sambo F, Barla A, Squillario M, Furlanello C, Toffolo G, Cobelli C. Effect of size and heterogeneity of samples on biomarker discovery: synthetic and real data assessment. PLoS One 2012; 7:e32200. [PMID: 22403633 PMCID: PMC3293892 DOI: 10.1371/journal.pone.0032200] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2011] [Accepted: 01/24/2012] [Indexed: 01/04/2023] Open
Abstract
MOTIVATION The identification of robust lists of molecular biomarkers related to a disease is a fundamental step for early diagnosis and treatment. However, methodologies for the discovery of biomarkers using microarray data often provide results with limited overlap. These differences are imputable to 1) dataset size (few subjects with respect to the number of features); 2) heterogeneity of the disease; 3) heterogeneity of experimental protocols and computational pipelines employed in the analysis. In this paper, we focus on the first two issues and assess, both on simulated (through an in silico regulation network model) and real clinical datasets, the consistency of candidate biomarkers provided by a number of different methods. METHODS We extensively simulated the effect of heterogeneity characteristic of complex diseases on different sets of microarray data. Heterogeneity was reproduced by simulating both intrinsic variability of the population and the alteration of regulatory mechanisms. Population variability was simulated by modeling evolution of a pool of subjects; then, a subset of them underwent alterations in regulatory mechanisms so as to mimic the disease state. RESULTS The simulated data allowed us to outline advantages and drawbacks of different methods across multiple studies and varying number of samples and to evaluate precision of feature selection on a benchmark with known biomarkers. Although comparable classification accuracy was reached by different methods, the use of external cross-validation loops is helpful in finding features with a higher degree of precision and stability. Application to real data confirmed these results.
Collapse
Affiliation(s)
| | - Tiziana Sanavia
- Information Engineering Department, University of Padova, Padova, Italy
| | - Matteo Martini
- Information Engineering Department, University of Padova, Padova, Italy
| | | | - Francesco Sambo
- Information Engineering Department, University of Padova, Padova, Italy
| | - Annalisa Barla
- Department of Computer and Information Science, University of Genova, Genova, Italy
| | | | | | - Gianna Toffolo
- Information Engineering Department, University of Padova, Padova, Italy
| | - Claudio Cobelli
- Information Engineering Department, University of Padova, Padova, Italy
| |
Collapse
|
47
|
Figueroa RL, Zeng-Treitler Q, Kandula S, Ngo LH. Predicting sample size required for classification performance. BMC Med Inform Decis Mak 2012; 12:8. [PMID: 22336388 PMCID: PMC3307431 DOI: 10.1186/1472-6947-12-8] [Citation(s) in RCA: 203] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2011] [Accepted: 02/15/2012] [Indexed: 01/13/2023] Open
Abstract
Background Supervised learning methods need annotated data in order to generate efficient models. Annotated data, however, is a relatively scarce resource and can be expensive to obtain. For both passive and active learning methods, there is a need to estimate the size of the annotated sample required to reach a performance target. Methods We designed and implemented a method that fits an inverse power law model to points of a given learning curve created using a small annotated training set. Fitting is carried out using nonlinear weighted least squares optimization. The fitted model is then used to predict the classifier's performance and confidence interval for larger sample sizes. For evaluation, the nonlinear weighted curve fitting method was applied to a set of learning curves generated using clinical text and waveform classification tasks with active and passive sampling methods, and predictions were validated using standard goodness of fit measures. As control we used an un-weighted fitting method. Results A total of 568 models were fitted and the model predictions were compared with the observed performances. Depending on the data set and sampling method, it took between 80 to 560 annotated samples to achieve mean average and root mean squared error below 0.01. Results also show that our weighted fitting method outperformed the baseline un-weighted method (p < 0.05). Conclusions This paper describes a simple and effective sample size prediction algorithm that conducts weighted fitting of learning curves. The algorithm outperformed an un-weighted algorithm described in previous literature. It can help researchers determine annotation sample size for supervised machine learning.
Collapse
Affiliation(s)
- Rosa L Figueroa
- Dep. Ing. Eléctrica, Facultad de Ingeniería, Universidad de Concepción, Concepción, Chile
| | | | | | | |
Collapse
|
48
|
Ben-Hamo R, Efroni S. Biomarker robustness reveals the PDGF network as driving disease outcome in ovarian cancer patients in multiple studies. BMC SYSTEMS BIOLOGY 2012; 6:3. [PMID: 22236809 PMCID: PMC3298526 DOI: 10.1186/1752-0509-6-3] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/13/2011] [Accepted: 01/11/2012] [Indexed: 12/27/2022]
Abstract
Background Ovarian cancer causes more deaths than any other gynecological cancer. Identifying the molecular mechanisms that drive disease progress in ovarian cancer is a critical step in providing therapeutics, improving diagnostics, and affiliating clinical behavior with disease etiology. Identification of molecular interactions that stratify prognosis is key in facilitating a clinical-molecular perspective. Results The Cancer Genome Atlas has recently made available the molecular characteristics of more than 500 patients. We used the TCGA multi-analysis study, and two additional datasets and a set of computational algorithms that we developed. The computational algorithms are based on methods that identify network alterations and quantify network behavior through gene expression. We identify a network biomarker that significantly stratifies survival rates in ovarian cancer patients. Interestingly, expression levels of single or sets of genes do not explain the prognostic stratification. The discovered biomarker is composed of the network around the PDGF pathway. The biomarker enables prognosis stratification. Conclusion The work presented here demonstrates, through the power of gene-expression networks, the criticality of the PDGF network in driving disease course. In uncovering the specific interactions within the network, that drive the phenotype, we catalyze targeted treatment, facilitate prognosis and offer a novel perspective into hidden disease heterogeneity.
Collapse
|
49
|
Yao C, Li H, Shen X, He Z, He L, Guo Z. Reproducibility and concordance of differential DNA methylation and gene expression in cancer. PLoS One 2012; 7:e29686. [PMID: 22235325 PMCID: PMC3250460 DOI: 10.1371/journal.pone.0029686] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2011] [Accepted: 12/01/2011] [Indexed: 12/11/2022] Open
Abstract
Background Hundreds of genes with differential DNA methylation of promoters have been identified for various cancers. However, the reproducibility of differential DNA methylation discoveries for cancer and the relationship between DNA methylation and aberrant gene expression have not been systematically analysed. Methodology/Principal Findings Using array data for seven types of cancers, we first evaluated the effects of experimental batches on differential DNA methylation detection. Second, we compared the directions of DNA methylation changes detected from different datasets for the same cancer. Third, we evaluated the concordance between methylation and gene expression changes. Finally, we compared DNA methylation changes in different cancers. For a given cancer, the directions of methylation and expression changes detected from different datasets, excluding potential batch effects, were highly consistent. In different cancers, DNA hypermethylation was highly inversely correlated with the down-regulation of gene expression, whereas hypomethylation was only weakly correlated with the up-regulation of genes. Finally, we found that genes commonly hypomethylated in different cancers primarily performed functions associated with chronic inflammation, such as ‘keratinization’, ‘chemotaxis’ and ‘immune response’. Conclusions Batch effects could greatly affect the discovery of DNA methylation biomarkers. For a particular cancer, both differential DNA methylation and gene expression can be reproducibly detected from different studies with no batch effects. While DNA hypermethylation is significantly linked to gene down-regulation, hypomethylation is only weakly correlated with gene up-regulation and is likely to be linked to chronic inflammation.
Collapse
Affiliation(s)
- Chen Yao
- Bioinformatics Centre and Key Laboratory for NeuroInfomation of the Education Ministry of China, School of Life Science, University of Electronic Science and Technology of China, Chengdu, China
| | - Hongdong Li
- Bioinformatics Centre and Key Laboratory for NeuroInfomation of the Education Ministry of China, School of Life Science, University of Electronic Science and Technology of China, Chengdu, China
| | - Xiaopei Shen
- Bioinformatics Centre and Key Laboratory for NeuroInfomation of the Education Ministry of China, School of Life Science, University of Electronic Science and Technology of China, Chengdu, China
| | - Zheng He
- Bioinformatics Centre and Key Laboratory for NeuroInfomation of the Education Ministry of China, School of Life Science, University of Electronic Science and Technology of China, Chengdu, China
| | - Lang He
- Bioinformatics Centre and Key Laboratory for NeuroInfomation of the Education Ministry of China, School of Life Science, University of Electronic Science and Technology of China, Chengdu, China
| | - Zheng Guo
- Bioinformatics Centre and Key Laboratory for NeuroInfomation of the Education Ministry of China, School of Life Science, University of Electronic Science and Technology of China, Chengdu, China
- Colleges of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
- * E-mail:
| |
Collapse
|
50
|
Hess KR, Wei C, Qi Y, Iwamoto T, Symmans WF, Pusztai L. Lack of sufficiently strong informative features limits the potential of gene expression analysis as predictive tool for many clinical classification problems. BMC Bioinformatics 2011; 12:463. [PMID: 22132775 PMCID: PMC3245512 DOI: 10.1186/1471-2105-12-463] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2011] [Accepted: 12/01/2011] [Indexed: 02/07/2023] Open
Abstract
Background Our goal was to examine how various aspects of a gene signature influence the success of developing multi-gene prediction models. We inserted gene signatures into three real data sets by altering the expression level of existing probe sets. We varied the number of probe sets perturbed (signature size), the fold increase of mean probe set expression in perturbed compared to unperturbed data (signature strength) and the number of samples perturbed. Prediction models were trained to identify which cases had been perturbed. Performance was estimated using Monte-Carlo cross validation. Results Signature strength had the greatest influence on predictor performance. It was possible to develop almost perfect predictors with as few as 10 features if the fold difference in mean expression values were > 2 even when the spiked samples represented 10% of all samples. We also assessed the gene signature set size and strength for 9 real clinical prediction problems in six different breast cancer data sets. Conclusions We found sufficiently large and strong predictive signatures only for distinguishing ER-positive from ER-negative cancers, there were no strong signatures for more subtle prediction problems. Current statistical methods efficiently identify highly informative features in gene expression data if such features exist and accurate models can be built with as few as 10 highly informative features. Features can be considered highly informative if at least 2-fold expression difference exists between comparison groups but such features do not appear to be common for many clinically relevant prediction problems in human data sets.
Collapse
Affiliation(s)
- Kenneth R Hess
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, USA
| | | | | | | | | | | |
Collapse
|