1
|
Javed MF, Fawad M, Lodhi R, Najeh T, Gamil Y. Forecasting the strength of preplaced aggregate concrete using interpretable machine learning approaches. Sci Rep 2024; 14:8381. [PMID: 38600161 PMCID: PMC11006863 DOI: 10.1038/s41598-024-57896-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 03/22/2024] [Indexed: 04/12/2024] Open
Abstract
Preplaced aggregate concrete (PAC) also known as two-stage concrete (TSC) is widely used in construction engineering for various applications. To produce PAC, a mixture of Portland cement, sand, and admixtures is injected into a mold subsequent to the deposition of coarse aggregate. This process complicates the prediction of compressive strength (CS), demanding thorough investigation. Consequently, the emphasis of this study is on enhancing the comprehension of PAC compressive strength using machine learning models. Thirteen models are evaluated with 261 data points and eleven input variables. The result depicts that xgboost demonstrates exceptional accuracy with a correlation coefficient of 0.9791 and a normalized coefficient of determination (R2) of 0.9583. Moreover, Gradient boosting (GB) and Cat boost (CB) also perform well due to its robust performance. In addition, Adaboost, Voting regressor, and Random forest yield precise predictions with low mean absolute error (MAE) and root mean square error (RMSE) values. The sensitivity analysis (SA) reveals the significant impact of key input parameters on overall model sensitivity. Notably, gravel takes the lead with a substantial 44.7% contribution, followed by sand at 19.5%, cement at 15.6%, and Fly ash and GGBS at 5.9% and 5.1%, respectively. The best fit model i.e., XG-Boost model, was employed for SHAP analysis to assess the relative importance of contributing attributes and optimize input variables. The SHAP analysis unveiled the water-to-binder (W/B) ratio, superplasticizer, and gravel as the most significant factors influencing the CS of PAC. Furthermore, graphical user interface (GUI) have been developed for practical applications in predicting concrete strength. This simplifies the process and offers a valuable tool for leveraging the model's potential in the field of civil engineering. This comprehensive evaluation provides valuable insights to researchers and practitioners, empowering them to make informed choices in predicting PAC compressive strength in construction projects. By enhancing the reliability and applicability of predictive models, this study contributes to the field of preplaced aggregate concrete strength prediction.
Collapse
Affiliation(s)
- Muhammad Faisal Javed
- Department of Civil Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi, Swabi, 23640, Pakistan.
| | - Muhammad Fawad
- Silesian University of Technology Poland, Gliwice, Poland
- Budapest University of Technology and Economics Hungary, Budapest, Hungary
| | - Rida Lodhi
- Department of Urban and Regional Planning, National University of Sciences and Technology (NUST), Islamabad, Pakistan
| | - Taoufik Najeh
- Operation and Maintenance, Operation, Maintenance and Acoustics, Department of Civil, Environmental and Natural Resources Engineering, Lulea University of Technology, Lulea, Sweden.
| | - Yaser Gamil
- Department of Civil Engineering, School of Engineering, Monash University Malaysia, Jalan Lagoon Selatan, 47500, Bandar Sunway, Selangor, Malaysia
| |
Collapse
|
2
|
Borisov N, Tkachev V, Simonov A, Sorokin M, Kim E, Kuzmin D, Karademir-Yilmaz B, Buzdin A. Uniformly shaped harmonization combines human transcriptomic data from different platforms while retaining their biological properties and differential gene expression patterns. Front Mol Biosci 2023; 10:1237129. [PMID: 37745690 PMCID: PMC10511763 DOI: 10.3389/fmolb.2023.1237129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 08/28/2023] [Indexed: 09/26/2023] Open
Abstract
Introduction: Co-normalization of RNA profiles obtained using different experimental platforms and protocols opens avenue for comprehensive comparison of relevant features like differentially expressed genes associated with disease. Currently, most of bioinformatic tools enable normalization in a flexible format that depends on the individual datasets under analysis. Thus, the output data of such normalizations will be poorly compatible with each other. Recently we proposed a new approach to gene expression data normalization termed Shambhala which returns harmonized data in a uniform shape, where every expression profile is transformed into a pre-defined universal format. We previously showed that following shambhalization of human RNA profiles, overall tissue-specific clustering features are strongly retained while platform-specific clustering is dramatically reduced. Methods: Here, we tested Shambhala performance in retention of fold-change gene expression features and other functional characteristics of gene clusters such as pathway activation levels and predicted cancer drug activity scores. Results: Using 6,793 cancer and 11,135 normal tissue gene expression profiles from the literature and experimental datasets, we applied twelve performance criteria for different versions of Shambhala and other methods of transcriptomic harmonization with flexible output data format. Such criteria dealt with the biological type classifiers, hierarchical clustering, correlation/regression properties, stability of drug efficiency scores, and data quality for using machine learning classifiers. Discussion: Shambhala-2 harmonizer demonstrated the best results with the close to 1 correlation and linear regression coefficients for the comparison of training vs validation datasets and more than two times lesser instability for calculation of drug efficiency scores compared to other methods.
Collapse
Affiliation(s)
- Nicolas Borisov
- Omicsway Corp, Walnut, CA, United States
- Moscow Institute of Physics and Technology, Dolgoprudny, Russia
| | | | - Alexander Simonov
- Moscow Institute of Physics and Technology, Dolgoprudny, Russia
- Oncobox Ltd., Moscow, Russia
| | - Maxim Sorokin
- Moscow Institute of Physics and Technology, Dolgoprudny, Russia
- Oncobox Ltd., Moscow, Russia
- World-Class Research Center “Digital Biodesign and Personalized Healthcare”, Sechenov First Moscow State Medical University, Moscow, Russia
| | - Ella Kim
- Clinic for Neurosurgery, Laboratory of Experimental Neurooncology, Johannes Gutenberg University Medical Centre, Mainz, Germany
| | - Denis Kuzmin
- Moscow Institute of Physics and Technology, Dolgoprudny, Russia
| | - Betul Karademir-Yilmaz
- Department of Biochemistry, School of Medicine/Genetic and Metabolic Diseases Research and Investigation Center (GEMHAM) Marmara University, Istanbul, Türkiye
| | - Anton Buzdin
- Moscow Institute of Physics and Technology, Dolgoprudny, Russia
- World-Class Research Center “Digital Biodesign and Personalized Healthcare”, Sechenov First Moscow State Medical University, Moscow, Russia
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, Russia
- PathoBiology Group, European Organization for Research and Treatment of Cancer (EORTC), Brussels, Belgium
| |
Collapse
|
3
|
Bailey R, Sarkar A, Singh A, Dobra A, Kahveci T. Optimal Supervised Reduction of High Dimensional Transcription Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3093-3105. [PMID: 37276117 DOI: 10.1109/tcbb.2023.3280557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The plight of navigating high-dimensional transcription datasets remains a persistent problem. This problem is further amplified for complex disorders, such as cancer as these disorders are often multigenic traits with multiple subsets of genes collectively affecting the type, stage, and severity of the trait. We are often faced with a trade off between reducing the dimensionality of our datasets and maintaining the integrity of our data. To accomplish both tasks simultaneously for very high dimensional transcriptome for complex multigenic traits, we propose a new supervised technique, Class Separation Transformation (CST). CST accomplishes both tasks simultaneously by significantly reducing the dimensionality of the input space into a one-dimensional transformed space that provides optimal separation between the differing classes. Furthermore, CST offers an means of explainable ML, as it computes the relative importance of each feature for its contribution to class distinction, which can thus lead to deeper insights and discovery. We compare our method with existing state-of-the-art methods using both real and synthetic datasets, demonstrating that CST is the more accurate, robust, scalable, and computationally advantageous technique relative to existing methods. Code used in this paper is available on https://github.com/richiebailey74/CST.
Collapse
|
4
|
Wang Q, Runhaar J, Kloppenburg M, Boers M, Bijlsma JWJ, Bacardit J, Bierma-Zeinstra SMA. A machine learning approach reveals features related to clinicians' diagnosis of clinically relevant knee osteoarthritis. Rheumatology (Oxford) 2023; 62:2732-2739. [PMID: 36534939 DOI: 10.1093/rheumatology/keac707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 12/09/2022] [Indexed: 08/03/2023] Open
Abstract
OBJECTIVES To identify highly ranked features related to clinicians' diagnosis of clinically relevant knee OA. METHODS General practitioners (GPs) and secondary care physicians (SPs) were recruited to evaluate 5-10 years follow-up clinical and radiographic data of knees from the CHECK cohort for the presence of clinically relevant OA. GPs and SPs were gathered in pairs; each pair consisted of one GP and one SP, and the paired clinicians independently evaluated the same subset of knees. A diagnosis was made for each knee by the GP and SP before and after viewing radiographic data. Nested 5-fold cross-validation enhanced random forest models were built to identify the top 10 features related to the diagnosis. RESULTS Seventeen clinician pairs evaluated 1106 knees with 139 clinical and 36 radiographic features. GPs diagnosed clinically relevant OA in 42% and 43% knees, before and after viewing radiographic data, respectively. SPs diagnosed in 43% and 51% knees, respectively. Models containing top 10 features had good performance for explaining clinicians' diagnosis with area under the curve ranging from 0.76-0.83. Before viewing radiographic data, quantitative symptomatic features (i.e. WOMAC scores) were the most important ones related to the diagnosis of both GPs and SPs; after viewing radiographic data, radiographic features appeared in the top lists for both, but seemed to be more important for SPs than GPs. CONCLUSIONS Random forest models presented good performance in explaining clinicians' diagnosis, which helped to reveal typical features of patients recognized as clinically relevant knee OA by clinicians from two different care settings.
Collapse
Affiliation(s)
- Qiuke Wang
- Department of General Practice, Erasmus MC University Center Rotterdam, Rotterdam, The Netherlands
| | - Jos Runhaar
- Department of General Practice, Erasmus MC University Center Rotterdam, Rotterdam, The Netherlands
| | - Margreet Kloppenburg
- Department of Rheumatology, Leiden University Medical Center, Leiden, The Netherlands
| | - Maarten Boers
- Department of Epidemiology and Biostatistics, Amsterdam UMC, Amsterdam, The Netherlands
| | - Johannes W J Bijlsma
- Department of Rheumatology and Clinical Immunology, University Medical Centre Utrecht, Utrecht, The Netherlands
| | - Jaume Bacardit
- School of Computing, Newcastle University, Newcastle, UK
| | - Sita M A Bierma-Zeinstra
- Department of General Practice, Erasmus MC University Center Rotterdam, Rotterdam, The Netherlands
- Department of Orthopaedics and Sport Medicine, Erasmus MC University Center Rotterdam, Rotterdam, The Netherlands
| |
Collapse
|
5
|
Chiu Y, Ni C, Huang Y. Deconvolution of bulk gene expression profiles reveals the association between immune cell polarization and the prognosis of hepatocellular carcinoma patients. Cancer Med 2023; 12:15736-15760. [PMID: 37366298 PMCID: PMC10417088 DOI: 10.1002/cam4.6197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 05/02/2023] [Accepted: 05/23/2023] [Indexed: 06/28/2023] Open
Abstract
BACKGROUND Many studies have utilized computational methods, including cell composition deconvolution (CCD), to correlate immune cell polarizations with the survival of cancer patients, including those with hepatocellular carcinoma (HCC). However, currently available cell deconvolution estimated (CDE) tools do not cover the wide range of immune cell changes that are known to influence tumor progression. RESULTS A new CCD tool, HCCImm, was designed to estimate the abundance of tumor cells and 16 immune cell types in the bulk gene expression profiles of HCC samples. HCCImm was validated using real datasets derived from human peripheral blood mononuclear cells (PBMCs) and HCC tissue samples, demonstrating that HCCImm outperforms other CCD tools. We used HCCImm to analyze the bulk RNA-seq datasets of The Cancer Genome Atlas (TCGA)-liver hepatocellular carcinoma (LIHC) samples. We found that the proportions of memory CD8+ T cells and Tregs were negatively associated with patient overall survival (OS). Furthermore, the proportion of naïve CD8+ T cells was positively associated with patient OS. In addition, the TCGA-LIHC samples with a high tumor mutational burden had a significantly high abundance of nonmacrophage leukocytes. CONCLUSIONS HCCImm was equipped with a new set of reference gene expression profiles that allowed for a more robust analysis of HCC patient expression data. The source code is provided at https://github.com/holiday01/HCCImm.
Collapse
Affiliation(s)
- Yen‐Jung Chiu
- Institute of Biomedical InformaticsNational Yang Ming Chiao Tung UniversityTaipeiTaiwan
- Department of Biomedical EngineeringMing Chuan UniversityTaoyuanTaiwan
| | - Chung‐En Ni
- Institute of Biomedical InformaticsNational Yang Ming Chiao Tung UniversityTaipeiTaiwan
| | - Yen‐Hua Huang
- Institute of Biomedical InformaticsNational Yang Ming Chiao Tung UniversityTaipeiTaiwan
- Center for Systems and Synthetic BiologyNational Yang Ming Chiao Tung UniversityTaipeiTaiwan
| |
Collapse
|
6
|
Shams B, Reisch K, Vajkoczy P, Lippert C, Picht T, Fekonja LS. Improved prediction of glioma-related aphasia by diffusion MRI metrics, machine learning, and automated fiber bundle segmentation. Hum Brain Mapp 2023. [PMID: 37318944 PMCID: PMC10365236 DOI: 10.1002/hbm.26393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 05/07/2023] [Accepted: 05/26/2023] [Indexed: 06/17/2023] Open
Abstract
White matter impairments caused by gliomas can lead to functional disorders. In this study, we predicted aphasia in patients with gliomas infiltrating the language network using machine learning methods. We included 78 patients with left-hemispheric perisylvian gliomas. Aphasia was graded preoperatively using the Aachen aphasia test (AAT). Subsequently, we created bundle segmentations based on automatically generated tract orientation mappings using TractSeg. To prepare the input for the support vector machine (SVM), we first preselected aphasia-related fiber bundles based on the associations between relative tract volumes and AAT subtests. In addition, diffusion magnetic resonance imaging (dMRI)-based metrics [axial diffusivity (AD), apparent diffusion coefficient (ADC), fractional anisotropy (FA), and radial diffusivity (RD)] were extracted within the fiber bundles' masks with their mean, standard deviation, kurtosis, and skewness values. Our model consisted of random forest-based feature selection followed by an SVM. The best model performance achieved 81% accuracy (specificity = 85%, sensitivity = 73%, and AUC = 85%) using dMRI-based features, demographics, tumor WHO grade, tumor location, and relative tract volumes. The most effective features resulted from the arcuate fasciculus (AF), middle longitudinal fasciculus (MLF), and inferior fronto-occipital fasciculus (IFOF). The most effective dMRI-based metrics were FA, ADC, and AD. We achieved a prediction of aphasia using dMRI-based features and demonstrated that AF, IFOF, and MLF were the most important fiber bundles for predicting aphasia in this cohort.
Collapse
Affiliation(s)
- Boshra Shams
- Department of Neurosurgery, Charité - Universitätsmedizin Berlin, Berlin, Germany
- Cluster of Excellence: "Matters of Activity. Image Space Material", Humboldt University, Berlin, Germany
| | - Klara Reisch
- Department of Neurosurgery, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Peter Vajkoczy
- Department of Neurosurgery, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Christoph Lippert
- Digital Health - Machine Learning, Hasso Plattner Institute, University of Potsdam, Digital Engineering Faculty, Potsdam, Germany
- Hasso Plattner Institute for Digital Health, Icahn School of Medicine at Mount Sinai, New York, USA
| | - Thomas Picht
- Department of Neurosurgery, Charité - Universitätsmedizin Berlin, Berlin, Germany
- Cluster of Excellence: "Matters of Activity. Image Space Material", Humboldt University, Berlin, Germany
| | - Lucius S Fekonja
- Department of Neurosurgery, Charité - Universitätsmedizin Berlin, Berlin, Germany
- Cluster of Excellence: "Matters of Activity. Image Space Material", Humboldt University, Berlin, Germany
| |
Collapse
|
7
|
Liu Z, Zhang T, Lin L, Long F, Guo H, Han L. Applications of radiomics-based analysis pipeline for predicting epidermal growth factor receptor mutation status. Biomed Eng Online 2023; 22:17. [PMID: 36810090 PMCID: PMC9945395 DOI: 10.1186/s12938-022-01049-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2022] [Accepted: 11/04/2022] [Indexed: 02/24/2023] Open
Abstract
BACKGROUND This study aimed to develop a pipeline for selecting the best feature engineering-based radiomic path to predict epidermal growth factor receptor (EGFR) mutant lung adenocarcinoma in 18F-fluorodeoxyglucose (FDG) positron emission tomography/computed tomography (PET/CT). METHODS The study enrolled 115 lung adenocarcinoma patients with EGFR mutation status from June 2016 and September 2017. We extracted radiomics features by delineating regions-of-interest around the entire tumor in 18F-FDG PET/CT images. The feature engineering-based radiomic paths were built by combining various methods of data scaling, feature selection, and many methods for predictive model-building. Next, a pipeline was developed to select the best path. RESULTS In the paths from CT images, the highest accuracy was 0.907 (95% confidence interval [CI]: 0.849, 0.966), the highest area under curve (AUC) was 0.917 (95% CI: 0.853, 0.981), and the highest F1 score was 0.908 (95% CI: 0.842, 0.974). In the paths based on PET images, the highest accuracy was 0.913 (95% CI: 0.863, 0.963), the highest AUC was 0.960 (95% CI: 0.926, 0.995), and the highest F1 score was 0.878 (95% CI: 0.815, 0.941). Additionally, a novel evaluation metric was developed to evaluate the comprehensive level of the models. Some feature engineering-based radiomic paths obtained promising results. CONCLUSIONS The pipeline is capable of selecting the best feature engineering-based radiomic path. Combining various feature engineering-based radiomic paths could compare their performances and identify paths built with the most appropriate methods to predict EGFR-mutant lung adenocarcinoma in 18FDG PET/CT. The pipeline proposed in this work can select the best feature engineering-based radiomic path.
Collapse
Affiliation(s)
- Zefeng Liu
- grid.412645.00000 0004 1757 9434Department of Radiology, Tianjin Medical University General Hospital, Tianjin, 300052 People’s Republic of China
| | - Tianyou Zhang
- grid.412645.00000 0004 1757 9434Department of Radiology, Tianjin Medical University General Hospital, Tianjin, 300052 People’s Republic of China
| | - Liying Lin
- grid.265021.20000 0000 9792 1228First Central Clinical College, Tianjin Medical University, 22 Qixiangtai Road, Heping District, Tianjin, 300070 People’s Republic of China
| | - Fenghua Long
- grid.506261.60000 0001 0706 7839Department of Radiology, Institute of Hematology and Blood Diseases Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Tianjin, 300041 People’s Republic of China
| | - Hongyu Guo
- Department of Radiology, Tianjin Medical University General Hospital, Tianjin, 300052, People's Republic of China.
| | - Li Han
- School of Medical Imaging, Tianjin Medical University, 9-307, Guangdong Rd. #1, Hexi, Tianjin, 300203, People's Republic of China. .,Department of Radiology, University of Michigan, Ann Arbor, Michigan, 48109, USA.
| |
Collapse
|
8
|
Kheyfets VO, Sweatt AJ, Gomberg-Maitland M, Ivy DD, Condliffe R, Kiely DG, Lawrie A, Maron BA, Zamanian RT, Stenmark KR. Computational platform for doctor-artificial intelligence cooperation in pulmonary arterial hypertension prognostication: a pilot study. ERJ Open Res 2023; 9:00484-2022. [PMID: 36776484 PMCID: PMC9907150 DOI: 10.1183/23120541.00484-2022] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Accepted: 10/20/2022] [Indexed: 11/25/2022] Open
Abstract
Background Pulmonary arterial hypertension (PAH) is a heterogeneous and complex pulmonary vascular disease associated with substantial morbidity. Machine-learning algorithms (used in many PAH risk calculators) can combine established parameters with thousands of circulating biomarkers to optimise PAH prognostication, but these approaches do not offer the clinician insight into what parameters drove the prognosis. The approach proposed in this study diverges from other contemporary phenotyping methods by identifying patient-specific parameters driving clinical risk. Methods We trained a random forest algorithm to predict 4-year survival risk in a cohort of 167 adult PAH patients evaluated at Stanford University, with 20% withheld for (internal) validation. Another cohort of 38 patients from Sheffield University were used as a secondary (external) validation. Shapley values, borrowed from game theory, were computed to rank the input parameters based on their importance to the predicted risk score for the entire trained random forest model (global importance) and for an individual patient (local importance). Results Between the internal and external validation cohorts, the random forest model predicted 4-year risk of death/transplant with sensitivity and specificity of 71.0-100% and 81.0-89.0%, respectively. The model reinforced the importance of established prognostic markers, but also identified novel inflammatory biomarkers that predict risk in some PAH patients. Conclusion These results stress the need for advancing individualised phenotyping strategies that integrate clinical and biochemical data with outcome. The computational platform presented in this study offers a critical step towards personalised medicine in which a clinician can interpret an algorithm's assessment of an individual patient.
Collapse
Affiliation(s)
- Vitaly O. Kheyfets
- Paediatric Critical Care Medicine, Developmental Lung Biology and CVP Research Laboratories, School of Medicine, University of Colorado, Aurora, CO, USA
| | - Andrew J. Sweatt
- Division of Pulmonary and Critical Care Medicine, Stanford University, Stanford, CA, USA
- Vera Moulton Wall Center for Pulmonary Vascular Disease, Stanford University, Stanford, CA, USA
| | | | - Dunbar D. Ivy
- Department of Paediatric Cardiology, Children's Hospital Colorado, Aurora, CO, USA
| | - Robin Condliffe
- Sheffield Pulmonary Vascular Disease Unit, Sheffield Teaching Hospitals NHS Foundation Trust, Royal Hallamshire Hospital, Sheffield, UK
| | - David G. Kiely
- Sheffield Pulmonary Vascular Disease Unit, Sheffield Teaching Hospitals NHS Foundation Trust, Royal Hallamshire Hospital, Sheffield, UK
- Department of Infection, Immunity and Cardiovascular Disease, University of Sheffield, Sheffield, UK
- Insigneo Institute for in-silico Medicine, University of Sheffield, Sheffield, UK
| | - Allan Lawrie
- Sheffield Pulmonary Vascular Disease Unit, Sheffield Teaching Hospitals NHS Foundation Trust, Royal Hallamshire Hospital, Sheffield, UK
- Department of Infection, Immunity and Cardiovascular Disease, University of Sheffield, Sheffield, UK
- Insigneo Institute for in-silico Medicine, University of Sheffield, Sheffield, UK
| | - Bradley A. Maron
- Division of Cardiovascular Medicine, Brigham and Women's Hospital and Harvard Medical School, Harvard University, Boston, MA, USA
| | - Roham T. Zamanian
- Division of Pulmonary and Critical Care Medicine, Stanford University, Stanford, CA, USA
- Vera Moulton Wall Center for Pulmonary Vascular Disease, Stanford University, Stanford, CA, USA
| | - Kurt R. Stenmark
- Paediatric Critical Care Medicine, Developmental Lung Biology and CVP Research Laboratories, School of Medicine, University of Colorado, Aurora, CO, USA
| |
Collapse
|
9
|
Mapping Mediterranean maquis formations using Sentinel-2 time-series. ECOL INFORM 2022. [DOI: 10.1016/j.ecoinf.2022.101814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
10
|
Borisov N, Buzdin A. Transcriptomic Harmonization as the Way for Suppressing Cross-Platform Bias and Batch Effect. Biomedicines 2022; 10:2318. [PMID: 36140419 PMCID: PMC9496268 DOI: 10.3390/biomedicines10092318] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Revised: 09/14/2022] [Accepted: 09/16/2022] [Indexed: 11/16/2022] Open
Abstract
(1) Background: Emergence of methods interrogating gene expression at high throughput gave birth to quantitative transcriptomics, but also posed a question of inter-comparison of expression profiles obtained using different equipment and protocols and/or in different series of experiments. Addressing this issue is challenging, because all of the above variables can dramatically influence gene expression signals and, therefore, cause a plethora of peculiar features in the transcriptomic profiles. Millions of transcriptomic profiles were obtained and deposited in public databases of which the usefulness is however strongly limited due to the inter-comparison issues; (2) Methods: Dozens of methods and software packages that can be generally classified as either flexible or predefined format harmonizers have been proposed, but none has become to the date the gold standard for unification of this type of Big Data; (3) Results: However, recent developments evidence that platform/protocol/batch bias can be efficiently reduced not only for the comparisons of limited transcriptomic datasets. Instead, instruments were proposed for transforming gene expression profiles into the universal, uniformly shaped format that can support multiple inter-comparisons for reasonable calculation costs. This forms a basement for universal indexing of all or most of all types of RNA sequencing and microarray hybridization profiles; (4) Conclusions: In this paper, we attempted to overview the landscape of modern approaches and methods in transcriptomic harmonization and focused on the practical aspects of their application.
Collapse
Affiliation(s)
- Nicolas Borisov
- World-Class Research Center “Digital Biodesign and Personalized Healthcare”, Sechenov First Moscow State Medical University, 119435 Moscow, Russia
- Moscow Institute of Physics and Technology, 141701 Dolgoprudny, Russia
| | - Anton Buzdin
- World-Class Research Center “Digital Biodesign and Personalized Healthcare”, Sechenov First Moscow State Medical University, 119435 Moscow, Russia
- Moscow Institute of Physics and Technology, 141701 Dolgoprudny, Russia
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, 117997 Moscow, Russia
- PathoBiology Group, European Organization for Research and Treatment of Cancer (EORTC), 1200 Brussels, Belgium
| |
Collapse
|
11
|
Soriano MA, Deziel NC, Saiers JE. Regional Scale Assessment of Shallow Groundwater Vulnerability to Contamination from Unconventional Hydrocarbon Extraction. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2022; 56:12126-12136. [PMID: 35960643 PMCID: PMC9454823 DOI: 10.1021/acs.est.2c00470] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Revised: 07/29/2022] [Accepted: 08/01/2022] [Indexed: 05/19/2023]
Abstract
Concerns over unconventional oil and gas (UOG) development persist, especially in rural communities that rely on shallow groundwater for drinking and other domestic purposes. Given the continued expansion of the industry, regional (vs local scale) models are needed to characterize groundwater contamination risks faced by the increasing proportion of the population residing in areas that accommodate UOG extraction. In this paper, we evaluate groundwater vulnerability to contamination from surface spills and shallow subsurface leakage of UOG wells within a 104,000 km2 region in the Appalachian Basin, northeastern USA. We test a computationally efficient ensemble approach for simulating groundwater flow and contaminant transport processes to quantify vulnerability with high resolution. We also examine metamodels, or machine learning models trained to emulate physically based models, and investigate their spatial transferability. We identify predictors describing proximity to UOG, hydrology, and topography that are important for metamodels to make accurate vulnerability predictions outside their training regions. Using our approach, we estimate that 21,000-30,000 individuals in our study area are dependent on domestic water wells that are vulnerable to contamination from UOG activities. Our novel modeling framework could be used to guide groundwater monitoring, provide information for public health studies, and assess environmental justice issues.
Collapse
Affiliation(s)
- Mario A. Soriano
- School
of the Environment, Yale University, New Haven, Connecticut 06511, United States
| | - Nicole C. Deziel
- School
of Public Health, Yale University, New Haven, Connecticut 06510, United States
| | - James E. Saiers
- School
of the Environment, Yale University, New Haven, Connecticut 06511, United States
| |
Collapse
|
12
|
Huang HH, Rao H, Miao R, Liang Y. A novel meta-analysis based on data augmentation and elastic data shared lasso regularization for gene expression. BMC Bioinformatics 2022; 23:353. [PMID: 35999505 PMCID: PMC9396780 DOI: 10.1186/s12859-022-04887-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 08/10/2022] [Indexed: 12/22/2022] Open
Abstract
Background Gene expression analysis can provide useful information for analyzing complex biological mechanisms. However, many reported findings are unrepeatable due to small sample sizes relative to a large number of genes and the low signal-to-noise ratios of most gene expression datasets. Results Meta-analysis of multi-data sets is an efficient method for tackling the above problem. To improve the performance of meta-analysis, we propose a novel meta-analysis framework. It consists of two parts: (1) a novel data augmentation strategy. Various cross-platform normalization methods exist, which can preserve original biological information of gene expression datasets from different angles and add different “perturbations” to the dataset. Using such perturbation, we provide a feasible means for gene expression data augmentation; (2) elastic data shared lasso (DSL-\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${{\varvec{L}}}_{\mathbf{2}}$$\end{document}L2). The DSL-\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\mathbf{L}}_{\mathbf{2}}$$\end{document}L2 method spans the continuum between individual models for each dataset and one model for all datasets. It also overcomes the shortcomings of the data shared lasso method when dealing with highly correlated features. Comprehensive simulation experiment results show that the proposed method has high prediction and gene selection performance. We then apply the proposed method to non-small cell lung cancer (NSCLC) blood gene expression data in order to identify key tumor-related genes. The outcomes of our experiment indicate that the method could be used for identifying a set of robust disease-related gene signatures that may be used for NSCLC early diagnosis or prognosis or even targeting. Conclusion We propose a novel and effective meta-analysis method for biological research, extrapolating and integrating information from multiple gene expression datasets.
Collapse
Affiliation(s)
- Hai-Hui Huang
- Provincial Demonstration Software Institute, Shaoguan University, Shaoguan, China
| | - Hao Rao
- Provincial Demonstration Software Institute, Shaoguan University, Shaoguan, China
| | - Rui Miao
- Faculty of Information Technology, Macau University of Science and Technology, Macau, China
| | - Yong Liang
- The Peng Cheng Laboratory, Shenzhen, China.
| |
Collapse
|
13
|
Invasion success of a freshwater fish corresponds to low dissolved oxygen and diminished riparian integrity. Biol Invasions 2022. [DOI: 10.1007/s10530-022-02827-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
14
|
Ganaie M, Tanveer M, Suganthan P, Snasel V. Oblique and rotation double random forest. Neural Netw 2022; 153:496-517. [DOI: 10.1016/j.neunet.2022.06.012] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Revised: 05/25/2022] [Accepted: 06/09/2022] [Indexed: 10/18/2022]
|
15
|
|
16
|
Borisov N, Sorokin M, Zolotovskaya M, Borisov C, Buzdin A. Shambhala-2: A Protocol for Uniformly Shaped Harmonization of Gene Expression Profiles of Various Formats. Curr Protoc 2022; 2:e444. [PMID: 35617464 DOI: 10.1002/cpz1.444] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Uniformly shaped harmonization of gene expression profiles is central for the simultaneous comparison of multiple gene expression datasets. It is expected to operate with the gene expression data obtained using various experimental methods and equipment, and to return harmonized profiles in a uniform shape. Such uniformly shaped expression profiles from different initial datasets can be further compared directly. However, current harmonization techniques have strong limitations that prevent their broad use for bioinformatic applications. They can either operate with only up to two datasets/platforms or return data in a dynamic format that will be different for every comparison under analysis. This also does not allow for adding new data to the previously harmonized dataset(s), which complicates the analysis and increases calculation costs. We propose here a new method termed Shambhala-2 that can transform multi-platform expression data into a universal format that is identical for all harmonizations made using this technique. Shambhala-2 is based on sample-by-sample cubic conversion of the initial expression dataset into a preselected shape of the reference definitive dataset. Using 8390 samples of 12 healthy human tissue types and 4086 samples of colorectal, kidney, and lung cancer tissues, we verified Shambhala-2's capacity in restoring tissue-specific expression patterns for seven microarray and three RNA sequencing platforms. Shambhala-2 performed well for all tested combinations of RNAseq and microarray profiles, and retained gene-expression ranks, as evidenced by high correlations between different single- or aggregated gene expression metrics in pre- and post-Shambhalized samples, including preserving cancer-specific gene expression and pathway activation features. © 2022 Wiley Periodicals LLC. Basic Protocol: Shambhala-2 harmonizer Alternate Protocol 1: Linear Shambhala/Shambhala-1 Alternate Protocol 2: Alternative (flexible-format and uniformly shaped) normalization methods Support Protocol 1: Watermelon multisection (WM) Support Protocol 2: Calculation of cancer-to-normal log-fold-change (LFC) and pathway activation level (PAL).
Collapse
Affiliation(s)
- Nicolas Borisov
- Omicsway Corp., Walnut, California.,Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia
| | - Maksim Sorokin
- Omicsway Corp., Walnut, California.,Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia.,I.M. Sechenov First Moscow State Medical University, Moscow, Russia
| | - Marianna Zolotovskaya
- Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia.,Oncobox Ltd., Moscow, Russia
| | | | - Anton Buzdin
- Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia.,Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, Russia.,World-Class Research Center "Digital biodesign and personalized healthcare", Sechenov First Moscow State Medical University, Moscow, Russia.,PathoBiology Group, European Organization for Research and Treatment of Cancer (EORTC), Brussels, Belgium
| |
Collapse
|
17
|
Guragain P, Båtnes AS, Zobolas J, Olsen Y, Bones AM, Winge P. IIb-RAD-sequencing coupled with random forest classification indicates regional population structuring and sex-specific differentiation in salmon lice ( Lepeophtheirus salmonis). Ecol Evol 2022; 12:e8809. [PMID: 35414904 PMCID: PMC8986551 DOI: 10.1002/ece3.8809] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 03/18/2022] [Accepted: 03/22/2022] [Indexed: 11/29/2022] Open
Abstract
The aquaculture industry has been dealing with salmon lice problems forming serious threats to salmonid farming. Several treatment approaches have been used to control the parasite. Treatment effectiveness must be optimized, and the systematic genetic differences between subpopulations must be studied to monitor louse species and enhance targeted control measures. We have used IIb-RAD sequencing in tandem with a random forest classification algorithm to detect the regional genetic structure of the Norwegian salmon lice and identify important markers for sex differentiation of this species. We identified 19,428 single nucleotide polymorphisms (SNPs) from 95 individuals of salmon lice. These SNPs, however, were not able to distinguish the differential structure of lice populations. Using the random forest algorithm, we selected 91 SNPs important for geographical classification and 14 SNPs important for sex classification. The geographically important SNP data substantially improved the genetic understanding of the population structure and classified regional demographic clusters along the Norwegian coast. We also uncovered SNP markers that could help determine the sex of the salmon louse. A large portion of the SNPs identified to be under directional selection was also ranked highly important by random forest. According to our findings, there is a regional population structure of salmon lice associated with the geographical location along the Norwegian coastline.
Collapse
Affiliation(s)
- Prashanna Guragain
- Cell, Molecular Biology and Genomics GroupDepartment of BiologyNorwegian University of Science and TechnologyTrondheimNorway
- Taskforce Salmon LiceDepartment of BiologyNorwegian University of Science and TechnologyTrondheimNorway
| | - Anna Solvang Båtnes
- Taskforce Salmon LiceDepartment of BiologyNorwegian University of Science and TechnologyTrondheimNorway
| | - John Zobolas
- Cell, Molecular Biology and Genomics GroupDepartment of BiologyNorwegian University of Science and TechnologyTrondheimNorway
| | - Yngvar Olsen
- Taskforce Salmon LiceDepartment of BiologyNorwegian University of Science and TechnologyTrondheimNorway
| | - Atle M. Bones
- Cell, Molecular Biology and Genomics GroupDepartment of BiologyNorwegian University of Science and TechnologyTrondheimNorway
- Taskforce Salmon LiceDepartment of BiologyNorwegian University of Science and TechnologyTrondheimNorway
| | - Per Winge
- Cell, Molecular Biology and Genomics GroupDepartment of BiologyNorwegian University of Science and TechnologyTrondheimNorway
- Taskforce Salmon LiceDepartment of BiologyNorwegian University of Science and TechnologyTrondheimNorway
| |
Collapse
|
18
|
Eggers B, Schork K, Turewicz M, Barkovits K, Eisenacher M, Schröder R, Clemen CS, Marcus K. Advanced Fiber Type-Specific Protein Profiles Derived from Adult Murine Skeletal Muscle. Proteomes 2021; 9:proteomes9020028. [PMID: 34201234 PMCID: PMC8293376 DOI: 10.3390/proteomes9020028] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Revised: 06/01/2021] [Accepted: 06/02/2021] [Indexed: 02/07/2023] Open
Abstract
Skeletal muscle is a heterogeneous tissue consisting of blood vessels, connective tissue, and muscle fibers. The last are highly adaptive and can change their molecular composition depending on external and internal factors, such as exercise, age, and disease. Thus, examination of the skeletal muscles at the fiber type level is essential to detect potential alterations. Therefore, we established a protocol in which myosin heavy chain isoform immunolabeled muscle fibers were laser microdissected and separately investigated by mass spectrometry to develop advanced proteomic profiles of all murine skeletal muscle fiber types. All data are available via ProteomeXchange with the identifier PXD025359. Our in-depth mass spectrometric analysis revealed unique fiber type protein profiles, confirming fiber type-specific metabolic properties and revealing a more versatile function of type IIx fibers. Furthermore, we found that multiple myopathy-associated proteins were enriched in type I and IIa fibers. To further optimize the assignment of fiber types based on the protein profile, we developed a hypothesis-free machine-learning approach, identified a discriminative peptide panel, and confirmed our panel using a public data set.
Collapse
Affiliation(s)
- Britta Eggers
- Medizinisches Proteom-Center, Medical Faculty, Ruhr-University Bochum, 44801 Bochum, Germany; (K.S.); (M.T.); (K.B.); (M.E.)
- Medical Proteome Analysis, Center for Protein Diagnostics (PRODI), Ruhr-University Bochum, 44801 Bochum, Germany
- Correspondence: (B.E.); (K.M.)
| | - Karin Schork
- Medizinisches Proteom-Center, Medical Faculty, Ruhr-University Bochum, 44801 Bochum, Germany; (K.S.); (M.T.); (K.B.); (M.E.)
- Medical Proteome Analysis, Center for Protein Diagnostics (PRODI), Ruhr-University Bochum, 44801 Bochum, Germany
| | - Michael Turewicz
- Medizinisches Proteom-Center, Medical Faculty, Ruhr-University Bochum, 44801 Bochum, Germany; (K.S.); (M.T.); (K.B.); (M.E.)
- Medical Proteome Analysis, Center for Protein Diagnostics (PRODI), Ruhr-University Bochum, 44801 Bochum, Germany
| | - Katalin Barkovits
- Medizinisches Proteom-Center, Medical Faculty, Ruhr-University Bochum, 44801 Bochum, Germany; (K.S.); (M.T.); (K.B.); (M.E.)
- Medical Proteome Analysis, Center for Protein Diagnostics (PRODI), Ruhr-University Bochum, 44801 Bochum, Germany
| | - Martin Eisenacher
- Medizinisches Proteom-Center, Medical Faculty, Ruhr-University Bochum, 44801 Bochum, Germany; (K.S.); (M.T.); (K.B.); (M.E.)
- Medical Proteome Analysis, Center for Protein Diagnostics (PRODI), Ruhr-University Bochum, 44801 Bochum, Germany
| | - Rolf Schröder
- Institute of Neuropathology, University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nürnberg, 91054 Erlangen, Germany;
| | - Christoph S. Clemen
- German Aerospace Center, Institute of Aerospace Medicine, 51147 Cologne, Germany;
- Center for Physiology and Pathophysiology, Institute of Vegetative Physiology, Medical Faculty, University of Cologne, 50931 Cologne, Germany
| | - Katrin Marcus
- Medizinisches Proteom-Center, Medical Faculty, Ruhr-University Bochum, 44801 Bochum, Germany; (K.S.); (M.T.); (K.B.); (M.E.)
- Medical Proteome Analysis, Center for Protein Diagnostics (PRODI), Ruhr-University Bochum, 44801 Bochum, Germany
- Correspondence: (B.E.); (K.M.)
| |
Collapse
|
19
|
Speiser JL. A random forest method with feature selection for developing medical prediction models with clustered and longitudinal data. J Biomed Inform 2021; 117:103763. [PMID: 33781921 PMCID: PMC8131242 DOI: 10.1016/j.jbi.2021.103763] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Revised: 03/03/2021] [Accepted: 03/23/2021] [Indexed: 12/22/2022]
Abstract
BACKGROUND Machine learning methodologies are gaining popularity for developing medical prediction models for datasets with a large number of predictors, particularly in the setting of clustered and longitudinal data. Binary Mixed Model (BiMM) forest is a promising machine learning algorithm which may be applied to develop prediction models for clustered and longitudinal binary outcomes. Although machine learning methods for clustered and longitudinal methods such as BiMM forest exist, feature selection has not been analyzed via data simulations. Feature selection improves the practicality and ease of use of prediction models for clinicians by reducing the burden of data collection. Thus, feature selection procedures are not only beneficial, but are often necessary for development of medical prediction models. In this study, we aim to assess feature selection within the BiMM forest setting for modeling clustered and longitudinal binary outcomes. METHODS We conducted a simulation study to compare BiMM forest with feature selection (backward elimination or stepwise selection) to standard generalized linear mixed model feature selection methods (shrinkage and backward elimination). We also evaluated feature selection methods to develop models predicting mobility disability in older adults using the Health, Aging and Body Composition Study dataset as an example utilization of the proposed methodology. RESULTS BiMM forest with backward elimination generally offered higher computational efficiency, similar or higher predictive performance (accuracy and area under the receiver operating curve), and similar or higher ability to identify correct features compared to linear methods for the different simulated scenarios. For predicting mobility disability in older adults, methods generally performed similarly in terms of accuracy, area under the receiver operating curve, and specificity; however, BiMM forest with backward elimination had the highest sensitivity. CONCLUSIONS This study is novel because it is the first investigation of feature selection for developing random forest prediction models for clustered and longitudinal binary outcomes. Results from the simulation study reveal that BiMM forest with backward elimination has the highest accuracy (performance and identification of correct features) and lowest computation time compared to other feature selection methods in some scenarios and similar performance in other scenarios. Many informatics datasets have clustered and longitudinal outcomes and results from this study suggest that BiMM forest with backward elimination may be beneficial for developing medical prediction models.
Collapse
Affiliation(s)
- Jaime Lynn Speiser
- Department of Biostatistics and Data Science, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA.
| |
Collapse
|
20
|
Yaşar Ş, Çolak C, Yoloğlu S. Artificial Intelligence-Based Prediction of Covid-19 Severity on the Results of Protein Profiling. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 202:105996. [PMID: 33631640 PMCID: PMC7882428 DOI: 10.1016/j.cmpb.2021.105996] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Accepted: 02/06/2021] [Indexed: 05/21/2023]
Abstract
BACKGROUND COVID-19 progresses slowly and negatively affects many people. However, mild to moderate symptoms develop in most infected people, who recover without hospitalization. Therefore, the development of early diagnosis and treatment strategies is essential. One of these methods is proteomic technology based on the blood protein profiling technique. This study aims to classify three COVID-19 positive patient groups (mild, severe, and critical) and a control group based on the blood protein profiling using deep learning (DL), random forest (RF), and gradient boosted trees (GBTs). METHODS The dataset consists of 93 samples (60 COVID-19 patients, 33 control), and 370 variables obtained from an open-source website. The current dataset contains age, gender, and 368 protein, used to predict the relationship between disease severity and proteins using DL and machine learning approaches (RF, GBTs). An evolutionary algorithm tunes hyperparameters of the models and the predictions are assessed through accuracy, sensitivity, specificity, precision, F1 score, classification error, and kappa performance metrics. RESULTS The accuracy of RF (96.21%) was higher as compared to DL (94.73%). However, the ensemble classifier GBTs produced the highest accuracy (96.98%). TGB1BP2 in the cardiovascular II panel and MILR1 in the inflammation panel were the two most important proteins associated with disease severity. CONCLUSIONS The proposed model (GBTs) achieved the best prediction of disease severity based on the proteins compared to the other algorithms. The results point out that changes in blood proteins associated with the severity of COVID-19 may be used in monitoring and early diagnosis/treatment of the disease.
Collapse
Affiliation(s)
- Şeyma Yaşar
- Inonu University, Faculty of Medicine, Department of Biostatistics and Medical Informatics, Malatya, Turkey
| | - Cemil Çolak
- Inonu University, Faculty of Medicine, Department of Biostatistics and Medical Informatics, Malatya, Turkey
| | - Saim Yoloğlu
- Inonu University, Faculty of Medicine, Department of Biostatistics and Medical Informatics, Malatya, Turkey
| |
Collapse
|
21
|
Use of Machine Learning to Determine the Information Value of a BMI Screening Program. Am J Prev Med 2021; 60:425-433. [PMID: 33483154 PMCID: PMC8610445 DOI: 10.1016/j.amepre.2020.10.016] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/28/2020] [Revised: 10/13/2020] [Accepted: 10/14/2020] [Indexed: 12/12/2022]
Abstract
INTRODUCTION Childhood obesity continues to be a significant public health issue in the U.S. and is associated with short- and long-term adverse health outcomes. A number of states have implemented school-based BMI screening programs. However, these programs have been criticized for not being effective in improving students' BMI or reducing childhood obesity. One potential benefit, however, of screening programs is the identification of younger children at risk of obesity as they age. METHODS This study used a unique panel data set from the BMI screening program for public school children in the state of Arkansas collected from 2003 to 2004 through the 2018-2019 academic years and analyzed in 2020. Machine learning algorithms were applied to understand the informational value of BMI screening. Specifically, this study evaluated the importance of BMI information during kindergarten to the accurate prediction of childhood obesity by the 4th grade. RESULTS Kindergarten BMI z-score is the most important predictor of obesity by the 4th grade and is much more important to prediction than sociodemographic and socioeconomic variables that would otherwise be available to policymakers in the absence of the screening program. Including the kindergarten BMI z-score of students in the model meaningfully increases the accuracy of the prediction. CONCLUSIONS Data from the Arkansas BMI screening program greatly improve the ability to identify children at greatest risk of future obesity to the extent that better prediction can be translated into more effective policy and better health outcomes. This is a heretofore unexamined benefit of school-based BMI screening.
Collapse
|
22
|
Suzuki T, Kano S, Suzuki M, Yasukawa S, Mizumachi T, Tsushima N, Hatanaka KC, Hatanaka Y, Matsuno Y, Homma A. Enhanced Angiogenesis in Salivary Duct Carcinoma Ex-Pleomorphic Adenoma. Front Oncol 2021; 10:603717. [PMID: 33692941 PMCID: PMC7937931 DOI: 10.3389/fonc.2020.603717] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2020] [Accepted: 12/30/2020] [Indexed: 11/23/2022] Open
Abstract
Salivary duct carcinoma (SDC) is morphologically similar to breast cancer, with HER2-overexpression reported. With regard to the pattern of disease onset, SDC can arise from de novo or carcinoma ex-pleomorphic adenoma (Ca-ex-PA). Recently, multiple molecular profiles of SDC as well as breast cancer have been reported, with significant differences in HER2 expression between Ca-ex-PA and de novo. We assessed the differences in gene expression between onset classifications. We conducted immunohistochemical analysis and HER2-DISH for 23 patients and classified SDCs into three subtypes as follows: “HER2-positive” (HER2+/any AR), “Luminal-AR” (HER2-/AR+), and “Basal-like” (HER2-/AR-). We assessed the expression levels of 84 functional genes for 19 patients by using a qRT-PCR array. Ten cases were classified as HER2-positive, seven cases as Luminal-AR, and six cases as Basal-like. The gene expression pattern was generally consistent with the corresponding immunostaining classification. The expression levels of VEGFA, ERBB2(HER2), IGF1R, RB1, and XBP1 were higher, while those of SLIT2 and PTEN were lower in Ca-ex-PA than in de novo. The functions of those genes were concentrated in angiogenesis and AKT/PI3K signaling pathway (Fisher’s test: p-value = 0.025 and 0.004, respectively). Multiple machine learning methods, OPLS-DA, LASSO, and RandomForest, also show that VEGFA can be a candidate for the characteristic differences between Ca-ex-PA and de novo. In conclusion, the AKT/PI3K signaling pathway leading to angiogenesis was hyper-activated in all SDCs, particularly in those classified into the Ca-ex-PAs. VEGFA was over-expressed significantly in the Ca-ex-PA, which can be a crucial factor in the malignant conversion to SDC.
Collapse
Affiliation(s)
- Takayoshi Suzuki
- Department of Otolaryngology-Head and Neck Surgery, Graduate School of Medicine, Hokkaido University, Sapporo, Japan
| | - Satoshi Kano
- Department of Otolaryngology-Head and Neck Surgery, Graduate School of Medicine, Hokkaido University, Sapporo, Japan
| | - Masanobu Suzuki
- Department of Otolaryngology-Head and Neck Surgery, Graduate School of Medicine, Hokkaido University, Sapporo, Japan
| | - Shinichiro Yasukawa
- Department of Otolaryngology-Head and Neck Surgery, Graduate School of Medicine, Hokkaido University, Sapporo, Japan
| | - Takatsugu Mizumachi
- Department of Otolaryngology-Head and Neck Surgery, Graduate School of Medicine, Hokkaido University, Sapporo, Japan
| | - Nayuta Tsushima
- Department of Otolaryngology-Head and Neck Surgery, Graduate School of Medicine, Hokkaido University, Sapporo, Japan
| | - Kanako C Hatanaka
- Clinical Research & Medical Innovation Center, Hokkaido University Hospital, Sapporo, Japan
| | - Yutaka Hatanaka
- Department of Surgical Pathology, Hokkaido University Hospital, Sapporo, Japan
| | - Yoshihiro Matsuno
- Department of Surgical Pathology, Hokkaido University Hospital, Sapporo, Japan
| | - Akihiro Homma
- Department of Otolaryngology-Head and Neck Surgery, Graduate School of Medicine, Hokkaido University, Sapporo, Japan
| |
Collapse
|
23
|
Myall AC, Perkins S, Rushton D, David J, Spencer P, Jones AR, Antczak P. An OMICs based meta-analysis to support infection state stratification. Bioinformatics 2021; 37:2347-2355. [PMID: 33560295 PMCID: PMC8388022 DOI: 10.1093/bioinformatics/btab089] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 01/06/2021] [Accepted: 01/24/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION A fundamental problem for disease treatment is that while antibiotics are a powerful counter to bacteria, they are ineffective against viruses. Often, bacterial and viral infections are confused due to their similar symptoms and lack of rapid diagnostics. With many clinicians relying primarily on symptoms for diagnosis, overuse and misuse of modern antibiotics are rife, contributing to the growing pool of antibiotic resistance. To ensure an individual receives optimal treatment given their disease state and to reduce over-prescription of antibiotics, the host response can in theory be measured quickly to distinguish between the two states. To establish a predictive biomarker panel of disease state (viral/bacterial/no-infection) we conducted a meta-analysis of human blood infection studies using Machine Learning (ML). RESULTS We focused on publicly available gene expression data from two widely used platforms, Affymetrix and Illumina microarrays as they represented a significant proportion of the available data. We were able to develop multi-class models with high accuracies with our best model predicting 93% of bacterial and 89% viral samples correctly. To compare the selected features in each of the different technologies, we reverse engineered the underlying molecular regulatory network and explored the neighbourhood of the selected features. The networks highlighted that although on the gene-level the models differed, they contained genes from the same areas of the network. Specifically, this convergence was to pathways including the Type I interferon Signalling Pathway, Chemotaxis, Apoptotic Processes, and Inflammatory/Innate Response. AVAILABILITY Data and code are available on the Gene Expression Omnibus and github. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ashleigh C Myall
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom.,Department of Mathematics, Imperial College London, London, United Kingdom
| | - Simon Perkins
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom
| | - David Rushton
- Defence and Security Analysis Division, Defence Science and Technology laboratory (DSTL), Porton Down, Salisbury, United Kingdom
| | - Jonathan David
- Chemical, Biological and Radiological Division, Defence Science and Technology laboratory (DSTL), Porton Down, Salisbury, United Kingdom
| | - Phillippa Spencer
- Cyber and Information Systems Division, Defence Science and Technology laboratory (DSTL), Porton Down, Salisbury United Kingdom
| | - Andrew R Jones
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom
| | - Philipp Antczak
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom.,Center for Molecular Medicine, University of Cologne, Cologne, Germany
| |
Collapse
|
24
|
Yu Z, Fu Y, Ai J, Zhang J, Huang G, Deng Y. Development of predicitve models to distinguish metals from non-metal toxicants, and individual metal from one another. BMC Bioinformatics 2020; 21:239. [PMID: 33272211 PMCID: PMC7712572 DOI: 10.1186/s12859-020-3525-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2020] [Accepted: 04/29/2020] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND Evaluating the toxicity of chemical mixture and their possible mechanism of action is still a challenge for humans and other organisms. Microarray classifier analysis has shown promise in the toxicogenomic area by identifying biomarkers to predict unknown samples. Our study focuses on identifying gene markers with better sensitivity and specificity, building predictive models to distinguish metals from non-metal toxicants, and individual metal from one another, and furthermore helping understand underlying toxic mechanisms. RESULTS Based on an independent dataset test, using only 15 gene markers, we were able to distinguish metals from non-metal toxicants with 100% accuracy. Of these, 6 and 9 genes were commonly down- and up-regulated respectively by most of the metals. 8 out of 15 genes belong to membrane protein coding genes. Function well annotated genes in the list include ADORA2B, ARNT, S100G, and DIO3. Also, a 10-gene marker list was identified that can discriminate an individual metal from one another with 100% accuracy. We could find a specific gene marker for each metal in the 10-gene marker list. Function well annotated genes in this list include GSTM2, HSD11B, AREG, and C8B. CONCLUSIONS Our findings suggest that using a microarray classifier analysis, not only can we create diagnostic classifiers for predicting an exact metal contaminant from a large scale of contaminant pool with high prediction accuracy, but we can also identify valuable biomarkers to help understand the common and underlying toxic mechanisms induced by metals.
Collapse
Affiliation(s)
- Zongtao Yu
- Department of Laboratory Medicine, Affiliated Taihe Hospital of Xi’an Jiaotong University Health Science Center, Shiyan, 442000 Hubei China
- Department of Laboratory Medicine, Shiyan Taihe Hospital, College of Biomedical Engineering, Hubei University of Medicine, Shiyan, 442000 Hubei China
- Department of Internal Medicine, Rush University Medical Center, Chicago, IL 60612 USA
| | - Yuanyuan Fu
- Bioinformatics Core, Department of Quantitative Health Sciences, University of Hawaii John A. Burns School of Medicine, Honolulu, HI 96813 USA
| | - Junmei Ai
- Department of Internal Medicine, Rush University Medical Center, Chicago, IL 60612 USA
| | - Jicai Zhang
- Department of Laboratory Medicine, Shiyan Taihe Hospital, College of Biomedical Engineering, Hubei University of Medicine, Shiyan, 442000 Hubei China
| | - Gang Huang
- Shanghai Key Laboratory for Molecular Imaging, Shanghai University of Medicine and Health Sciences, Shanghai, 201318 China
| | - Youping Deng
- Bioinformatics Core, Department of Quantitative Health Sciences, University of Hawaii John A. Burns School of Medicine, Honolulu, HI 96813 USA
| |
Collapse
|
25
|
Hu L, Liu B, Ji J, Li Y. Tree-Based Machine Learning to Identify and Understand Major Determinants for Stroke at the Neighborhood Level. J Am Heart Assoc 2020; 9:e016745. [PMID: 33140687 PMCID: PMC7763737 DOI: 10.1161/jaha.120.016745] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Background Stroke is a major cardiovascular disease that causes significant health and economic burden in the United States. Neighborhood community‐based interventions have been shown to be both effective and cost‐effective in preventing cardiovascular disease. There is a dearth of robust studies identifying the key determinants of cardiovascular disease and the underlying effect mechanisms at the neighborhood level. We aim to contribute to the evidence base for neighborhood cardiovascular health research. Methods and Results We created a new neighborhood health data set at the census tract level by integrating 4 types of potential predictors, including unhealthy behaviors, prevention measures, sociodemographic factors, and environmental measures from multiple data sources. We used 4 tree‐based machine learning techniques to identify the most critical neighborhood‐level factors in predicting the neighborhood‐level prevalence of stroke, and compared their predictive performance for variable selection. We further quantified the effects of the identified determinants on stroke prevalence using a Bayesian linear regression model. Of the 5 most important predictors identified by our method, higher prevalence of low physical activity, larger share of older adults, higher percentage of non‐Hispanic Black people, and higher ozone levels were associated with higher prevalence of stroke at the neighborhood level. Higher median household income was linked to lower prevalence. The most important interaction term showed an exacerbated adverse effect of aging and low physical activity on the neighborhood‐level prevalence of stroke. Conclusions Tree‐based machine learning provides insights into underlying drivers of neighborhood cardiovascular health by discovering the most important determinants from a wide range of factors in an agnostic, data‐driven, and reproducible way. The identified major determinants and the interactive mechanism can be used to prioritize and allocate resources to optimize community‐level interventions for stroke prevention.
Collapse
Affiliation(s)
- Liangyuan Hu
- Department of Population Health Science and Policy Icahn School of Medicine at Mount Sinai New York NY.,Institute for Health Care Delivery Science Icahn School of Medicine at Mount Sinai New York NY
| | - Bian Liu
- Department of Population Health Science and Policy Icahn School of Medicine at Mount Sinai New York NY
| | - Jiayi Ji
- Department of Population Health Science and Policy Icahn School of Medicine at Mount Sinai New York NY.,Institute for Health Care Delivery Science Icahn School of Medicine at Mount Sinai New York NY
| | - Yan Li
- Department of Population Health Science and Policy Icahn School of Medicine at Mount Sinai New York NY.,Department of Obstetrics, Gynecology, and Reproductive Science Icahn School of Medicine at Mount Sinai New York NY
| |
Collapse
|
26
|
S V, A J, R S, Mohan S, Bhattacharya S, Kaluri R, Feng G, Tariq U. Multi-modal prediction of breast cancer using particle swarm optimization with non-dominating sorting. INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS 2020; 16:155014772097150. [DOI: 10.1177/1550147720971505] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/15/2023]
Abstract
Cancer is enlisted as the second leading reason for death across the world wherein almost one person out of six dies of cancer. Breast cancer is one of the most common forms of cancer predominant in women having the second highest mortality rate in the world. Various scientific studies have been conducted to combat this disease, and machine learning approaches have been an extremely popular choice. Particle swarm optimization has been identified as one of the most powerful and efficient technique for the diagnosis of breast cancer guiding physicians towards timely and accurate treatment. It is also pertinent to mention that multi-modal prediction methods are used to make decisions depending upon different scenarios and aspects whereas the non-dominating sorting feature is useful to sort different objects based on differing requirements. The main novelty of this work is multi-modal prediction algorithm for breast cancer prediction is proposed. The work encompasses the use of particle swarm optimization, non-dominating sorting and multi-classifier techniques, namely, k-nearest neighbour method, fast decision tree and kernel density estimation. Finally, Bayes’ theorem is implemented for revising the results to achieve optimum accuracy in the breast cancer prediction. The proposed particle swarm optimization and non-domination sorting with classifier technique model helps to select the most significant features relevant to breast cancer predictions. The selected features design the objective of the problem model. The proposed model is implemented on the WBCD and WDBC breast cancer data sets publicly available from the UCI machine learning data repository. The metrics considered are sensitivity, specificity, accuracy and time complexity. The experimental results of the study using measures such as sensitivity, specificity, accuracy and time complexity. The experimental results of the study are evaluated against the state-of-the-art algorithms, namely, genetic algorithm kernel density estimation and particle swarm optimization kernel density estimation wherein the results justify the superiority of the proposed model.
Collapse
Affiliation(s)
- Vijayalakshmi S
- School of Computer Science and Engineering, Rajiv Gandhi College of Engineering & Technology, Puducherry, India
| | - John A
- School of Computer Science and Engineering, Galgotias University, Greater Noida, India
| | - Sunder R
- Department of Computer Science and Engineering, Sahrdaya College of Engineering and Technology, Thrissur, India
| | - Senthilkumar Mohan
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Sweta Bhattacharya
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Rajesh Kaluri
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Guang Feng
- Center of Network Information & Modern Education Technology, Guangdong University of Technology, Guangzhou, China
| | - Usman Tariq
- College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj, Saudi Arabia
| |
Collapse
|
27
|
A Comparative Study of Random Forest and Genetic Engineering Programming for the Prediction of Compressive Strength of High Strength Concrete (HSC). APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10207330] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Supervised machine learning and its algorithm is an emerging trend for the prediction of mechanical properties of concrete. This study uses an ensemble random forest (RF) and gene expression programming (GEP) algorithm for the compressive strength prediction of high strength concrete. The parameters include cement content, coarse aggregate to fine aggregate ratio, water, and superplasticizer. Moreover, statistical analyses like MAE, RSE, and RRMSE are used to evaluate the performance of models. The RF ensemble model outbursts in performance as it uses a weak base learner decision tree and gives an adamant determination of coefficient R2 = 0.96 with fewer errors. The GEP algorithm depicts a good response in between actual values and prediction values with an empirical relation. An external statistical check is also applied on RF and GEP models to validate the variables with data points. Artificial neural networks (ANNs) and decision tree (DT) are also used on a given data sample and comparison is made with the aforementioned models. Permutation features using python are done on the variables to give an influential parameter. The machine learning algorithm reveals a strong correlation between targets and predicts with less statistical measures showing the accuracy of the entire model.
Collapse
|
28
|
Zhang S, Shao J, Yu D, Qiu X, Zhang J. MatchMixeR: a cross-platform normalization method for gene expression data integration. Bioinformatics 2020; 36:2486-2491. [PMID: 31904810 DOI: 10.1093/bioinformatics/btz974] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 09/19/2019] [Accepted: 12/31/2019] [Indexed: 01/18/2023] Open
Abstract
MOTIVATION Combining gene expression (GE) profiles generated from different platforms enables previously infeasible studies due to sample size limitations. Several cross-platform normalization methods have been developed to remove the systematic differences between platforms, but they may also remove meaningful biological differences among datasets. In this work, we propose a novel approach that removes the platform, not the biological differences. Dubbed as 'MatchMixeR', we model platform differences by a linear mixed effects regression (LMER) model, and estimate them from matched GE profiles of the same cell line or tissue measured on different platforms. The resulting model can then be used to remove platform differences in other datasets. By using LMER, we achieve better bias-variance trade-off in parameter estimation. We also design a computationally efficient algorithm based on the moment method, which is ideal for ultra-high-dimensional LMER analysis. RESULTS Compared with several prominent competing methods, MatchMixeR achieved the highest after-normalization concordance. Subsequent differential expression analyses based on datasets integrated from different platforms showed that using MatchMixeR achieved the best trade-off between true and false discoveries, and this advantage is more apparent in datasets with limited samples or unbalanced group proportions. AVAILABILITY AND IMPLEMENTATION Our method is implemented in a R-package, 'MatchMixeR', freely available at: https://github.com/dy16b/Cross-Platform-Normalization. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Serin Zhang
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| | - Jiang Shao
- Gilead Sciences Inc., Foster City, CA 94404, USA
| | - Disa Yu
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| | - Xing Qiu
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY 14624, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| |
Collapse
|
29
|
Ke H, Wu Y, Wang R, Wu X. Creation of a Prognostic Risk Prediction Model for Lung Adenocarcinoma Based on Gene Expression, Methylation, and Clinical Characteristics. Med Sci Monit 2020; 26:e925833. [PMID: 33021972 PMCID: PMC7549534 DOI: 10.12659/msm.925833] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Background This study aimed to identify important marker genes in lung adenocarcinoma (LACC) and establish a prognostic risk model to predict the risk of LACC in patients. Material/Methods Gene expression and methylation profiles for LACC and clinical information about cases were downloaded from the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) databases, respectively. Differentially expressed genes (DEGs) and differentially methylated genes (DMGs) between cancer and control groups were selected through meta-analysis. Pearson coefficient correlation analysis was performed to identify intersections between DEGs and DMGs and a functional analysis was performed on the genes that were correlated. Marker genes and clinical factors significantly related to prognosis were identified using univariate and multivariate Cox regression analyses. Risk prediction models were then created based on the marker genes and clinical factors. Results In total, 1975 DEGs and 2095 DMGs were identified. After comparison, 16 prognosis-related genes (EFNB2, TSPAN7, INPP5A, VAMP2, CALML5, SNAI2, RHOBTB1, CKB, ATF7IP2, RIMS2, RCBTB2, YBX1, RAB27B, NFATC1, TCEAL4, and SLC16A3) were selected from 265 overlapping genes. Four clinical factors (pathologic N [node], pathologic T [tumor], pathologic stage, and new tumor) were associated with prognosis. The prognostic risk prediction models were constructed and validated with other independent datasets. Conclusions An integrated model that combines clinical factors and gene markers is useful for predicting risk of LACC in patients. The 16 genes that were identified, including EFNB2, TSPAN7, INPP5A, VAMP2, and CALML5, may serve as novel biomarkers for diagnosis of LACC and prediction of disease prognosis.
Collapse
Affiliation(s)
- Honggang Ke
- Department of Cardiovascular and Thoracic Surgery, Affiliated Hospital of Nantong University, Nantong, Jiangsu, China (mainland)
| | - Yunyu Wu
- Qixiu Campus, Nantong University, Nantong, Jiangsu, China (mainland)
| | - Runjie Wang
- Department of Oncology, Wuxi People's Hospital, Wuxi, Jiangsu, China (mainland)
| | - Xiaohong Wu
- Department of Medical Oncology, Affiliated Hospital of Jiangnan University and Wuxi 4th People's Hospital, Wuxi, Jiangsu, China (mainland)
| |
Collapse
|
30
|
Zhang J, Xu D, Hao K, Zhang Y, Chen W, Liu J, Gao R, Wu C, De Marinis Y. FS-GBDT: identification multicancer-risk module via a feature selection algorithm by integrating Fisher score and GBDT. Brief Bioinform 2020; 22:5901960. [PMID: 34020547 DOI: 10.1093/bib/bbaa189] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Revised: 07/03/2020] [Accepted: 07/21/2020] [Indexed: 11/14/2022] Open
Abstract
Cancer is a highly heterogeneous disease caused by dysregulation in different cell types and tissues. However, different cancers may share common mechanisms. It is critical to identify decisive genes involved in the development and progression of cancer, and joint analysis of multiple cancers may help to discover overlapping mechanisms among different cancers. In this study, we proposed a fusion feature selection framework attributed to ensemble method named Fisher score and Gradient Boosting Decision Tree (FS-GBDT) to select robust and decisive feature genes in high-dimensional gene expression datasets. Joint analysis of 11 human cancers types was conducted to explore the key feature genes subset of cancer. To verify the efficacy of FS-GBDT, we compared it with four other common feature selection algorithms by Support Vector Machine (SVM) classifier. The algorithm achieved highest indicators, outperforms other four methods. In addition, we performed gene ontology analysis and literature validation of the key gene subset, and this subset were classified into several functional modules. Functional modules can be used as markers of disease to replace single gene which is difficult to be found repeatedly in applications of gene chip, and to study the core mechanisms of cancer.
Collapse
Affiliation(s)
- Jialin Zhang
- School of Mathematics and Statistics at Shandong University, China
| | - Da Xu
- School of Mathematics and Statistics at Shandong University, China
| | - Kaijing Hao
- School of Mathematics and Statistics at Shandong University, China
| | - Yusen Zhang
- academic leader of Computer Engineering in Shandong University, China
| | - Wei Chen
- School of Mathematics and Statistics at Shandong University, China
| | - Jiaguo Liu
- School of Mathematics and Statistics at Shandong University, China
| | - Rui Gao
- School of Control Science and Engineering, Shandong University
| | - Chuanyan Wu
- School of Intelligent Engineering in Shandong Management University
| | | |
Collapse
|
31
|
Accurate Nonendoscopic Detection of Barrett's Esophagus by Methylated DNA Markers: A Multisite Case Control Study. Am J Gastroenterol 2020; 115:1201-1209. [PMID: 32558685 PMCID: PMC7415629 DOI: 10.14309/ajg.0000000000000656] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
INTRODUCTION Nonendoscopic Barrett's esophagus (BE) screening may help improve esophageal adenocarcinoma outcomes. We previously demonstrated promising accuracy of methylated DNA markers (MDMs) for the nonendoscopic diagnosis of BE using samples obtained from a capsule sponge-on-string (SOS) device. We aimed to assess the accuracy of these MDMs in an independent cohort using a commercial grade assay. METHODS BE cases had ≥ 1 cm of circumferential BE with intestinal metaplasia; controls had no endoscopic evidence of BE. The SOS device was withdrawn 8 minutes after swallowing, followed by endoscopy (the criterion standard). Highest performing MDMs from a previous study were blindly assessed on extracted bisulfite-converted DNA by target enrichment long-probe quantitative amplified signal (TELQAS) assays. Optimal MDM combinations were selected and analyzed using random forest modeling with in silico cross-validation. RESULTS Of 295 patients consented, 268 (91%) swallowed the SOS device; 112 cases and 89 controls met the pre-established inclusion criteria. The median BE length was 6 cm (interquartile range 4-9), and 50% had no dysplasia. The cross-validated sensitivity and specificity of a 5 MDM random forest model were 92% (95% confidence interval 85%-96%) and 94% (95% confidence interval 87%-98%), respectively. Model performance was not affected by age, gender, or smoking history but was influenced by the BE segment length. SOS administration was well tolerated (median [interquartile range] tolerability 2 [0, 4] on 10 scale grading), and 95% preferred SOS over endoscopy. DISCUSSION Using a minimally invasive molecular approach, MDMs assayed from SOS samples show promise as a safe and accurate nonendoscopic test for BE prediction.
Collapse
|
32
|
Azodi CB, Tang J, Shiu SH. Opening the Black Box: Interpretable Machine Learning for Geneticists. Trends Genet 2020; 36:442-455. [PMID: 32396837 DOI: 10.1016/j.tig.2020.03.005] [Citation(s) in RCA: 114] [Impact Index Per Article: 28.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Revised: 03/12/2020] [Accepted: 03/16/2020] [Indexed: 01/16/2023]
Abstract
Because of its ability to find complex patterns in high dimensional and heterogeneous data, machine learning (ML) has emerged as a critical tool for making sense of the growing amount of genetic and genomic data available. While the complexity of ML models is what makes them powerful, it also makes them difficult to interpret. Fortunately, efforts to develop approaches that make the inner workings of ML models understandable to humans have improved our ability to make novel biological insights. Here, we discuss the importance of interpretable ML, different strategies for interpreting ML models, and examples of how these strategies have been applied. Finally, we identify challenges and promising future directions for interpretable ML in genetics and genomics.
Collapse
Affiliation(s)
- Christina B Azodi
- Department of Plant Biology, Michigan State University, East Lansing, MI, USA; Bioinformatics and Cellular Genomics, St. Vincent's Institute of Medical Research, Fitzroy, Victoria, Australia.
| | - Jiliang Tang
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Shin-Han Shiu
- Department of Plant Biology, Michigan State University, East Lansing, MI, USA; Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI, USA.
| |
Collapse
|
33
|
Serra A, Fratello M, Cattelani L, Liampa I, Melagraki G, Kohonen P, Nymark P, Federico A, Kinaret PAS, Jagiello K, Ha MK, Choi JS, Sanabria N, Gulumian M, Puzyn T, Yoon TH, Sarimveis H, Grafström R, Afantitis A, Greco D. Transcriptomics in Toxicogenomics, Part III: Data Modelling for Risk Assessment. NANOMATERIALS (BASEL, SWITZERLAND) 2020; 10:E708. [PMID: 32276469 PMCID: PMC7221955 DOI: 10.3390/nano10040708] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 03/25/2020] [Accepted: 03/26/2020] [Indexed: 12/30/2022]
Abstract
Transcriptomics data are relevant to address a number of challenges in Toxicogenomics (TGx). After careful planning of exposure conditions and data preprocessing, the TGx data can be used in predictive toxicology, where more advanced modelling techniques are applied. The large volume of molecular profiles produced by omics-based technologies allows the development and application of artificial intelligence (AI) methods in TGx. Indeed, the publicly available omics datasets are constantly increasing together with a plethora of different methods that are made available to facilitate their analysis, interpretation and the generation of accurate and stable predictive models. In this review, we present the state-of-the-art of data modelling applied to transcriptomics data in TGx. We show how the benchmark dose (BMD) analysis can be applied to TGx data. We review read across and adverse outcome pathways (AOP) modelling methodologies. We discuss how network-based approaches can be successfully employed to clarify the mechanism of action (MOA) or specific biomarkers of exposure. We also describe the main AI methodologies applied to TGx data to create predictive classification and regression models and we address current challenges. Finally, we present a short description of deep learning (DL) and data integration methodologies applied in these contexts. Modelling of TGx data represents a valuable tool for more accurate chemical safety assessment. This review is the third part of a three-article series on Transcriptomics in Toxicogenomics.
Collapse
Affiliation(s)
- Angela Serra
- Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.)
- BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
| | - Michele Fratello
- Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.)
- BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
| | - Luca Cattelani
- Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.)
- BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
| | - Irene Liampa
- School of Chemical Engineering, National Technical University of Athens, 157 80 Athens, Greece; (I.L.); (H.S.)
| | - Georgia Melagraki
- Nanoinformatics Department, NovaMechanics Ltd., Nicosia 1065, Cyprus; (G.M.); (A.A.)
| | - Pekka Kohonen
- Institute of Environmental Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden; (P.K.); (P.N.); (R.G.)
- Division of Toxicology, Misvik Biology, 20520 Turku, Finland
| | - Penny Nymark
- Institute of Environmental Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden; (P.K.); (P.N.); (R.G.)
- Division of Toxicology, Misvik Biology, 20520 Turku, Finland
| | - Antonio Federico
- Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.)
- BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
| | - Pia Anneli Sofia Kinaret
- Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.)
- BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
- Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland
| | - Karolina Jagiello
- QSAR Lab Ltd., Aleja Grunwaldzka 190/102, 80-266 Gdansk, Poland; (K.J.); (T.P.)
- University of Gdansk, Faculty of Chemistry, Wita Stwosza 63, 80-308 Gdansk, Poland
| | - My Kieu Ha
- Center for Next Generation Cytometry, Hanyang University, Seoul 04763, Korea; (M.K.H.); (J.-S.C.); (T.-H.Y.)
- Department of Chemistry, College of Natural Sciences, Hanyang University, Seoul 04763, Korea
- Institute of Next Generation Material Design, Hanyang University, Seoul 04763, Korea
| | - Jang-Sik Choi
- Center for Next Generation Cytometry, Hanyang University, Seoul 04763, Korea; (M.K.H.); (J.-S.C.); (T.-H.Y.)
- Department of Chemistry, College of Natural Sciences, Hanyang University, Seoul 04763, Korea
- Institute of Next Generation Material Design, Hanyang University, Seoul 04763, Korea
| | - Natasha Sanabria
- National Institute for Occupational Health, Johannesburg 30333, South Africa; (N.S.); (M.G.)
| | - Mary Gulumian
- National Institute for Occupational Health, Johannesburg 30333, South Africa; (N.S.); (M.G.)
- Haematology and Molecular Medicine Department, School of Pathology, University of the Witwatersrand, Johannesburg 2050, South Africa
| | - Tomasz Puzyn
- QSAR Lab Ltd., Aleja Grunwaldzka 190/102, 80-266 Gdansk, Poland; (K.J.); (T.P.)
- University of Gdansk, Faculty of Chemistry, Wita Stwosza 63, 80-308 Gdansk, Poland
| | - Tae-Hyun Yoon
- Center for Next Generation Cytometry, Hanyang University, Seoul 04763, Korea; (M.K.H.); (J.-S.C.); (T.-H.Y.)
- Department of Chemistry, College of Natural Sciences, Hanyang University, Seoul 04763, Korea
- Institute of Next Generation Material Design, Hanyang University, Seoul 04763, Korea
| | - Haralambos Sarimveis
- School of Chemical Engineering, National Technical University of Athens, 157 80 Athens, Greece; (I.L.); (H.S.)
| | - Roland Grafström
- Institute of Environmental Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden; (P.K.); (P.N.); (R.G.)
- Division of Toxicology, Misvik Biology, 20520 Turku, Finland
| | - Antreas Afantitis
- Nanoinformatics Department, NovaMechanics Ltd., Nicosia 1065, Cyprus; (G.M.); (A.A.)
| | - Dario Greco
- Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.)
- BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
- Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland
| |
Collapse
|
34
|
Chiu YJ, Hsieh YH, Huang YH. Improved cell composition deconvolution method of bulk gene expression profiles to quantify subsets of immune cells. BMC Med Genomics 2019; 12:169. [PMID: 31856824 PMCID: PMC6923925 DOI: 10.1186/s12920-019-0613-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2019] [Accepted: 10/31/2019] [Indexed: 01/07/2023] Open
Abstract
Background To facilitate the investigation of the pathogenic roles played by various immune cells in complex tissues such as tumors, a few computational methods for deconvoluting bulk gene expression profiles to predict cell composition have been created. However, available methods were usually developed along with a set of reference gene expression profiles consisting of imbalanced replicates across different cell types. Therefore, the objective of this study was to create a new deconvolution method equipped with a new set of reference gene expression profiles that incorporate more microarray replicates of the immune cells that have been frequently implicated in the poor prognosis of cancers, such as T helper cells, regulatory T cells and macrophage M1/M2 cells. Methods Our deconvolution method was developed by choosing ε-support vector regression (ε-SVR) as the core algorithm assigned with a loss function subject to the L1-norm penalty. To construct the reference gene expression signature matrix for regression, a subset of differentially expressed genes were chosen from 148 microarray-based gene expression profiles for 9 types of immune cells by using ANOVA and minimizing condition number. Agreement analyses including mean absolute percentage errors and Bland-Altman plots were carried out to compare the performances of our method and CIBERSORT. Results In silico cell mixtures, simulated bulk tissues, and real human samples with known immune-cell fractions were used as the test datasets for benchmarking. Our method outperformed CIBERSORT in the benchmarks using in silico breast tissue-immune cell mixtures in the proportions of 30:70 and 50:50, and in the benchmark using 164 human PBMC samples. Our results suggest that the performance of our method was at least comparable to that of a state-of-the-art tool, CIBERSORT. Conclusions We developed a new cell composition deconvolution method and the implementation was entirely based on the publicly available R and Python packages. In addition, we compiled a new set of reference gene expression profiles, which might allow for a more robust prediction of the immune cell fractions from the expression profiles of cell mixtures. The source code of our method could be downloaded from https://github.com/holiday01/deconvolution-to-estimate-immune-cell-subsets.
Collapse
Affiliation(s)
- Yen-Jung Chiu
- Institute of Biomedical Informatics, National Yang-Ming University, No.155, Sec. 2, Li-Nong St., Beitou Dist, Taipei, 11221, Taiwan
| | - Yi-Hsuan Hsieh
- Institute of Biomedical Informatics, National Yang-Ming University, No.155, Sec. 2, Li-Nong St., Beitou Dist, Taipei, 11221, Taiwan
| | - Yen-Hua Huang
- Institute of Biomedical Informatics, National Yang-Ming University, No.155, Sec. 2, Li-Nong St., Beitou Dist, Taipei, 11221, Taiwan. .,Centre for Systems and Synthetic Biology, National Yang-Ming University, Taipei, 11221, Taiwan.
| |
Collapse
|
35
|
Mihaylov I, Kańduła M, Krachunov M, Vassilev D. A novel framework for horizontal and vertical data integration in cancer studies with application to survival time prediction models. Biol Direct 2019; 14:22. [PMID: 31752974 PMCID: PMC6868770 DOI: 10.1186/s13062-019-0249-6] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2018] [Accepted: 09/20/2019] [Indexed: 12/17/2022] Open
Abstract
Background Recently high-throughput technologies have been massively used alongside clinical tests to study various types of cancer. Data generated in such large-scale studies are heterogeneous, of different types and formats. With lack of effective integration strategies novel models are necessary for efficient and operative data integration, where both clinical and molecular information can be effectively joined for storage, access and ease of use. Such models, combined with machine learning methods for accurate prediction of survival time in cancer studies, can yield novel insights into disease development and lead to precise personalized therapies. Results We developed an approach for intelligent data integration of two cancer datasets (breast cancer and neuroblastoma) − provided in the CAMDA 2018 ‘Cancer Data Integration Challenge’, and compared models for prediction of survival time. We developed a novel semantic network-based data integration framework that utilizes NoSQL databases, where we combined clinical and expression profile data, using both raw data records and external knowledge sources. Utilizing the integrated data we introduced Tumor Integrated Clinical Feature (TICF) − a new feature for accurate prediction of patient survival time. Finally, we applied and validated several machine learning models for survival time prediction. Conclusion We developed a framework for semantic integration of clinical and omics data that can borrow information across multiple cancer studies. By linking data with external domain knowledge sources our approach facilitates enrichment of the studied data by discovery of internal relations. The proposed and validated machine learning models for survival time prediction yielded accurate results. Reviewers This article was reviewed by Eran Elhaik, Wenzhong Xiao and Carlos Loucera.
Collapse
Affiliation(s)
- Iliyan Mihaylov
- Faculty of Mathematics and Informatics, Sofia University, "St. Kliment Ohridski", 5 James Bourchier Blvd., Sofia, 1164, Bulgaria
| | - Maciej Kańduła
- Department of Biotechnology, Boku University, Vienna, 1180, Austria.,Institute for Machine Learning, Johannes Kepler University, Linz, 4040, Austria
| | - Milko Krachunov
- Faculty of Mathematics and Informatics, Sofia University, "St. Kliment Ohridski", 5 James Bourchier Blvd., Sofia, 1164, Bulgaria
| | - Dimitar Vassilev
- Faculty of Mathematics and Informatics, Sofia University, "St. Kliment Ohridski", 5 James Bourchier Blvd., Sofia, 1164, Bulgaria.
| |
Collapse
|
36
|
Speiser JL, Miller ME, Tooze J, Ip E. A Comparison of Random Forest Variable Selection Methods for Classification Prediction Modeling. EXPERT SYSTEMS WITH APPLICATIONS 2019; 134:93-101. [PMID: 32968335 PMCID: PMC7508310 DOI: 10.1016/j.eswa.2019.05.028] [Citation(s) in RCA: 228] [Impact Index Per Article: 45.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Random forest classification is a popular machine learning method for developing prediction models in many research settings. Often in prediction modeling, a goal is to reduce the number of variables needed to obtain a prediction in order to reduce the burden of data collection and improve efficiency. Several variable selection methods exist for the setting of random forest classification; however, there is a paucity of literature to guide users as to which method may be preferable for different types of datasets. Using 311 classification datasets freely available online, we evaluate the prediction error rates, number of variables, computation times and area under the receiver operating curve for many random forest variable selection methods. We compare random forest variable selection methods for different types of datasets (datasets with binary outcomes, datasets with many predictors, and datasets with imbalanced outcomes) and for different types of methods (standard random forest versus conditional random forest methods and test based versus performance based methods). Based on our study, the best variable selection methods for most datasets are Jiang's method and the method implemented in the VSURF R package. For datasets with many predictors, the methods implemented in the R packages varSelRF and Boruta are preferable due to computational efficiency. A significant contribution of this study is the ability to assess different variable selection techniques in the setting of random forest classification in order to identify preferable methods based on applications in expert and intelligent systems.
Collapse
Affiliation(s)
- Jaime Lynn Speiser
- Department of Biostatistical Sciences, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA
| | - Michael E. Miller
- Department of Biostatistical Sciences, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA
| | - Janet Tooze
- Department of Biostatistical Sciences, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA
| | - Edward Ip
- Department of Biostatistical Sciences, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA
| |
Collapse
|
37
|
Buzdin A, Sorokin M, Garazha A, Glusker A, Aleshin A, Poddubskaya E, Sekacheva M, Kim E, Gaifullin N, Giese A, Seryakov A, Rumiantsev P, Moshkovskii S, Moiseev A. RNA sequencing for research and diagnostics in clinical oncology. Semin Cancer Biol 2019; 60:311-323. [PMID: 31412295 DOI: 10.1016/j.semcancer.2019.07.010] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2019] [Accepted: 07/16/2019] [Indexed: 12/26/2022]
Abstract
Molecular diagnostics is becoming one of the major drivers of personalized oncology. With hundreds of different approved anticancer drugs and regimens of their administration, selecting the proper treatment for a patient is at least nontrivial task. This is especially sound for the cases of recurrent and metastatic cancers where the standard lines of therapy failed. Recent trials demonstrated that mutation assays have a strong limitation in personalized selection of therapeutics, consequently, most of the drugs cannot be ranked and only a small percentage of patients can benefit from the screening. Other approaches are, therefore, needed to address a problem of finding proper targeted therapies. The analysis of RNA expression (transcriptomic) profiles presents a reasonable solution because transcriptomics stands a few steps closer to tumor phenotype than the genome analysis. Several recent studies pioneered using transcriptomics for practical oncology and showed truly encouraging clinical results. The possibility of directly measuring of expression levels of molecular drugs' targets and profiling activation of the relevant molecular pathways enables personalized prioritizing for all types of molecular-targeted therapies. RNA sequencing is the most robust tool for the high throughput quantitative transcriptomics. Its use, potentials, and limitations for the clinical oncology will be reviewed here along with the technical aspects such as optimal types of biosamples, RNA sequencing profile normalization, quality controls and several levels of data analysis.
Collapse
Affiliation(s)
- Anton Buzdin
- I.M. Sechenov First Moscow State Medical University, Moscow, Russia; Omicsway Corp., Walnut, CA, USA; Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, Russia.
| | - Maxim Sorokin
- I.M. Sechenov First Moscow State Medical University, Moscow, Russia; Omicsway Corp., Walnut, CA, USA; Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, Russia
| | | | | | - Alex Aleshin
- Stanford University School of Medicine, Stanford, 94305, CA, USA
| | - Elena Poddubskaya
- I.M. Sechenov First Moscow State Medical University, Moscow, Russia; Vitamed Oncological Clinics, Moscow, Russia
| | - Marina Sekacheva
- I.M. Sechenov First Moscow State Medical University, Moscow, Russia
| | - Ella Kim
- Johannes Gutenberg University Mainz, Mainz, Germany
| | - Nurshat Gaifullin
- Lomonosov Moscow State University, Faculty of Medicine, Moscow, Russia
| | | | | | | | - Sergey Moshkovskii
- Institute of Biomedical Chemistry, Moscow, 119121, Russia; Pirogov Russian National Research Medical University (RNRMU), Moscow, 117997, Russia
| | - Alexey Moiseev
- I.M. Sechenov First Moscow State Medical University, Moscow, Russia
| |
Collapse
|
38
|
Acevedo A, Berthel A, DuBois D, Almon RR, Jusko WJ, Androulakis IP. Pathway-Based Analysis of the Liver Response to Intravenous Methylprednisolone Administration in Rats: Acute Versus Chronic Dosing. GENE REGULATION AND SYSTEMS BIOLOGY 2019; 13:1177625019840282. [PMID: 31019365 PMCID: PMC6466473 DOI: 10.1177/1177625019840282] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/14/2019] [Accepted: 03/05/2019] [Indexed: 12/25/2022]
Abstract
Pharmacological time-series data, from comparative dosing studies, are critical to characterizing drug effects. Reconciling the data from multiple studies is inevitably difficult; multiple in vivo high-throughput -omics studies are necessary to capture the global and temporal effects of the drug, but these experiments, though analogous, differ in (microarray or other) platforms, time-scales, and dosing regimens and thus cannot be directly combined or compared. This investigation addresses this reconciliation issue with a meta-analysis technique aimed at assessing the intrinsic activity at the pathway level. The purpose of this is to characterize the dosing effects of methylprednisolone (MPL), a widely used anti-inflammatory and immunosuppressive corticosteroid (CS), within the liver. A multivariate decomposition approach is applied to analyze acute and chronic MPL dosing in male adrenalectomized rats and characterize the dosing-dependent differences in the dynamic response of MPL-responsive signaling and metabolic pathways. We demonstrate how to deconstruct signaling and metabolic pathways into their constituent pathway activities, activities which are scored for intrinsic pathway activity. Dosing-induced changes in the dynamics of pathway activities are compared using a model-based assessment of pathway dynamics, extending the principles of pharmacokinetics/pharmacodynamics (PKPD) to describe pathway activities. The model-based approach enabled us to hypothesize on the likely emergence (or disappearance) of indirect dosing-dependent regulatory interactions, pointing to likely mechanistic implications of dosing of MPL transcriptional regulation. Both acute and chronic MPL administration induced a strong core of activity within pathway families including the following: lipid metabolism, amino acid metabolism, carbohydrate metabolism, metabolism of cofactors and vitamins, regulation of essential organelles, and xenobiotic metabolism pathway families. Pathway activities alter between acute and chronic dosing, indicating that MPL response is dosing dependent. Furthermore, because multiple pathway activities are dominant within a single pathway, we observe that pathways cannot be defined by a single response. Instead, pathways are defined by multiple, complex, and temporally related activities corresponding to different subgroups of genes within each pathway.
Collapse
Affiliation(s)
- Alison Acevedo
- Department of Biomedical Engineering,
Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey,
Piscataway, NJ, USA
| | - Ana Berthel
- Department of Biochemistry, Mount
Holyoke College, South Hadley, MA, USA
| | - Debra DuBois
- Department of Pharmaceutical Sciences,
School of Pharmacy and Pharmaceutical Sciences, The State University of New York at
Buffalo, Buffalo, NY, USA
- Department of Biological Sciences, The
State University of New York at Buffalo, Buffalo, NY, USA
| | - Richard R Almon
- Department of Pharmaceutical Sciences,
School of Pharmacy and Pharmaceutical Sciences, The State University of New York at
Buffalo, Buffalo, NY, USA
- Department of Biological Sciences, The
State University of New York at Buffalo, Buffalo, NY, USA
| | - William J Jusko
- Department of Pharmaceutical Sciences,
School of Pharmacy and Pharmaceutical Sciences, The State University of New York at
Buffalo, Buffalo, NY, USA
- Department of Biological Sciences, The
State University of New York at Buffalo, Buffalo, NY, USA
| | - Ioannis P Androulakis
- Department of Biomedical Engineering,
Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey,
Piscataway, NJ, USA
- Department of Chemical and Biochemical
Engineering, Robert Wood Johnson Medical School, Rutgers, The State University of
New Jersey, Piscataway, NJ, USA
- Department of Surgery, Robert Wood
Johnson Medical School, Rutgers, The State University of New Jersey, Piscataway, NJ,
USA
| |
Collapse
|
39
|
Zhou XH, Chu XY, Xue G, Xiong JH, Zhang HY. Identifying cancer prognostic modules by module network analysis. BMC Bioinformatics 2019; 20:85. [PMID: 30777030 PMCID: PMC6380061 DOI: 10.1186/s12859-019-2674-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2017] [Accepted: 02/08/2019] [Indexed: 02/08/2023] Open
Abstract
Background The identification of prognostic genes that can distinguish the prognostic risks of cancer patients remains a significant challenge. Previous works have proven that functional gene sets were more reliable for this task than the gene signature. However, few works have considered the cross-talk among functional gene sets, which may result in neglecting important prognostic gene sets for cancer. Results Here, we proposed a new method that considers both the interactions among modules and the prognostic correlation of the modules to identify prognostic modules in cancers. First, dense sub-networks in the gene co-expression network of cancer patients were detected. Second, cross-talk between every two modules was identified by a permutation test, thus generating the module network. Third, the prognostic correlation of each module was evaluated by the resampling method. Then, the GeneRank algorithm, which takes the module network and the prognostic correlations of all the modules as input, was applied to prioritize the prognostic modules. Finally, the selected modules were validated by survival analysis in various data sets. Our method was applied in three kinds of cancers, and the results show that our method succeeded in identifying prognostic modules in all the three cancers. In addition, our method outperformed state-of-the-art methods. Furthermore, the selected modules were significantly enriched with known cancer-related genes and drug targets of cancer, which may indicate that the genes involved in the modules may be drug targets for therapy. Conclusions We proposed a useful method to identify key modules in cancer prognosis and our prognostic genes may be good candidates for drug targets. Electronic supplementary material The online version of this article (10.1186/s12859-019-2674-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xiong-Hui Zhou
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, People's Republic of China
| | - Xin-Yi Chu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, People's Republic of China
| | - Gang Xue
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, People's Republic of China
| | - Jiang-Hui Xiong
- State Key Laboratory of Space Medicine Fundamentals and Application, China Astronaut Research and Training Center, Beijing, People's Republic of China.,Lab of Epigenetics and Health Tracking Technology, Space Institute of Southern China, Shenzhen, People's Republic of China
| | - Hong-Yu Zhang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, People's Republic of China.
| |
Collapse
|
40
|
Borisov N, Shabalina I, Tkachev V, Sorokin M, Garazha A, Pulin A, Eremin II, Buzdin A. Shambhala: a platform-agnostic data harmonizer for gene expression data. BMC Bioinformatics 2019; 20:66. [PMID: 30727942 PMCID: PMC6366102 DOI: 10.1186/s12859-019-2641-8] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 01/18/2019] [Indexed: 11/10/2022] Open
Abstract
Background Harmonization techniques make different gene expression profiles and their sets compatible and ready for comparisons. Here we present a new bioinformatic tool termed Shambhala for harmonization of multiple human gene expression datasets obtained using different experimental methods and platforms of microarray hybridization and RNA sequencing. Results Unlike previously published methods enabling good quality data harmonization for only two datasets, Shambhala allows conversion of multiple datasets into the universal form suitable for further comparisons. Shambhala harmonization is based on the calibration of gene expression profiles using the auxiliary standardization dataset. Each profile is transformed to make it similar to the output of microarray hybridization platform Affymetrix Human Gene. This platform was chosen because it has the biggest number of human gene expression profiles deposited in public databases. We evaluated Shambhala ability to retain biologically important features after harmonization. The same four biological samples taken in multiple replicates were profiled independently using three and four different experimental platforms, respectively, then Shambhala-harmonized and investigated by hierarchical clustering. Conclusion Our results showed that unlike other frequently used methods: quantile normalization and DESeq/DESeq2 normalization, Shambhala harmonization was the only method supporting sample-specific and platform-independent biologically meaningful clustering for the data obtained from multiple experimental platforms. Electronic supplementary material The online version of this article (10.1186/s12859-019-2641-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Nicolas Borisov
- I.M. Sechenov First Moscow State Medical University, Sechenov University, Moscow, 119991, Russia. .,Department of bioinformatics and molecular networks, OmicsWay Corporation, Walnut, CA, USA.
| | - Irina Shabalina
- Faculty of Mathematics and Information Technologies, Petrozavodsk State University, Anokhina str., 20, Petrozavodsk, 185910, Russia
| | - Victor Tkachev
- Department of bioinformatics and molecular networks, OmicsWay Corporation, Walnut, CA, USA
| | - Maxim Sorokin
- I.M. Sechenov First Moscow State Medical University, Sechenov University, Moscow, 119991, Russia.,Department of bioinformatics and molecular networks, OmicsWay Corporation, Walnut, CA, USA.,Group for Genomic Regulation of Cell Signaling Systems, Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, 117997, Russia
| | - Andrew Garazha
- Department of bioinformatics and molecular networks, OmicsWay Corporation, Walnut, CA, USA.,Laboratory of Bioinformatics, Oncology and Immunology, D. Rogachyov Federal Research Center of Pediatric Hematology, Moscow, 117198, Russia
| | - Andrey Pulin
- Laboratory for Cell Biology and Developmental Pathology, Federal State Institution "Institute of General Pathology and Pathophysiology", FSBSI "IGPP", Moscow, Russia
| | - Ilya I Eremin
- Department for Regenerative Medicine, JSC Generium, Moscow, Russia
| | - Anton Buzdin
- I.M. Sechenov First Moscow State Medical University, Sechenov University, Moscow, 119991, Russia.,Department of bioinformatics and molecular networks, OmicsWay Corporation, Walnut, CA, USA.,Group for Genomic Regulation of Cell Signaling Systems, Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, 117997, Russia
| |
Collapse
|
41
|
Darst BF, Malecki KC, Engelman CD. Using recursive feature elimination in random forest to account for correlated variables in high dimensional data. BMC Genet 2018; 19:65. [PMID: 30255764 PMCID: PMC6157185 DOI: 10.1186/s12863-018-0633-8] [Citation(s) in RCA: 121] [Impact Index Per Article: 20.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Random forest (RF) is a machine-learning method that generally works well with high-dimensional problems and allows for nonlinear relationships between predictors; however, the presence of correlated predictors has been shown to impact its ability to identify strong predictors. The Random Forest-Recursive Feature Elimination algorithm (RF-RFE) mitigates this problem in smaller data sets, but this approach has not been tested in high-dimensional omics data sets. Results We integrated 202,919 genotypes and 153,422 methylation sites in 680 individuals, and compared the abilities of RF and RF-RFE to detect simulated causal associations, which included simulated genotype–methylation interactions, between these variables and triglyceride levels. Results show that RF was able to identify strong causal variables with a few highly correlated variables, but it did not detect other causal variables. Conclusions Although RF-RFE decreased the importance of correlated variables, in the presence of many correlated variables, it also decreased the importance of causal variables, making both hard to detect. These findings suggest that RF-RFE may not scale to high-dimensional data.
Collapse
Affiliation(s)
- Burcu F Darst
- Department of Population Health Sciences, School of Medicine and Public Health, University of Wisconsin, 610 Walnut Street, 1007 WARF, Madison, WI, 53726, USA
| | - Kristen C Malecki
- Department of Population Health Sciences, School of Medicine and Public Health, University of Wisconsin, 610 Walnut Street, 1007 WARF, Madison, WI, 53726, USA
| | - Corinne D Engelman
- Department of Population Health Sciences, School of Medicine and Public Health, University of Wisconsin, 610 Walnut Street, 1007 WARF, Madison, WI, 53726, USA.
| |
Collapse
|
42
|
Yan J, Kaur J. Feature Selection for Website Fingerprinting. PROCEEDINGS ON PRIVACY ENHANCING TECHNOLOGIES 2018. [DOI: 10.1515/popets-2018-0039] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Abstract
Website fingerprinting based on TCP/IP headers is of significant relevance to several Internet entities. Prior work has focused only on a limited set of features, and does not help understand the extents of fingerprint-ability. We address this by conducting an exhaustive feature analysis within eight different communication scenarios. Our analysis helps reveal several previously-unknown features in several scenarios, that can be used to fingerprint websites with much higher accuracy than previously demonstrated. This work helps the community better understand the extents of learnability (and vulnerability) from TCP/IP headers.
Collapse
|
43
|
Brieuc MSO, Waters CD, Drinan DP, Naish KA. A practical introduction to Random Forest for genetic association studies in ecology and evolution. Mol Ecol Resour 2018; 18:755-766. [PMID: 29504715 DOI: 10.1111/1755-0998.12773] [Citation(s) in RCA: 59] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2017] [Revised: 02/08/2018] [Accepted: 02/17/2018] [Indexed: 12/25/2022]
Abstract
Large genomic studies are becoming increasingly common with advances in sequencing technology, and our ability to understand how genomic variation influences phenotypic variation between individuals has never been greater. The exploration of such relationships first requires the identification of associations between molecular markers and phenotypes. Here, we explore the use of Random Forest (RF), a powerful machine-learning algorithm, in genomic studies to discern loci underlying both discrete and quantitative traits, particularly when studying wild or nonmodel organisms. RF is becoming increasingly used in ecological and population genetics because, unlike traditional methods, it can efficiently analyse thousands of loci simultaneously and account for nonadditive interactions. However, understanding both the power and limitations of Random Forest is important for its proper implementation and the interpretation of results. We therefore provide a practical introduction to the algorithm and its use for identifying associations between molecular markers and phenotypes, discussing such topics as data limitations, algorithm initiation and optimization, as well as interpretation. We also provide short R tutorials as examples, with the aim of providing a guide to the implementation of the algorithm. Topics discussed here are intended to serve as an entry point for molecular ecologists interested in employing Random Forest to identify trait associations in genomic data sets.
Collapse
Affiliation(s)
- Marine S O Brieuc
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA, USA.,Center for Ecological and Evolutionary Synthesis (CEES), Department of Biosciences, University of Oslo, Oslo, Norway
| | - Charles D Waters
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA, USA
| | - Daniel P Drinan
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA, USA
| | - Kerry A Naish
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA, USA
| |
Collapse
|
44
|
Qiu X, Zhang L, Nagaratnam Suganthan P, Amaratunga GA. Oblique random forest ensemble via Least Square Estimation for time series forecasting. Inf Sci (N Y) 2017. [DOI: 10.1016/j.ins.2017.08.060] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
45
|
Chen B, Gao S, Ji C, Song G. Integrated analysis reveals candidate genes and transcription factors in lung adenocarcinoma. Mol Med Rep 2017; 16:8371-8379. [PMID: 28983631 DOI: 10.3892/mmr.2017.7656] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2016] [Accepted: 02/23/2017] [Indexed: 11/06/2022] Open
Abstract
Lung adenocarcinoma is the most common type of non‑small cell lung cancer in Asia. Therefore, it is important to improve understanding of the underlying transcriptional regulatory mechanisms involved. The present study aimed to identify potential candidate genes and transcription factors (TFs) associated with the disease. Four gene expression profiles were downloaded from the Gene Expression Omnibus database, which included 141 lung adenocarcinoma patients and 191 healthy controls. The differentially expressed genes (DEGs) were screened out and functional annotation was performed. In addition, TFs were identified and a global transcriptional regulatory network was constructed. Integrated analysis gave rise to a total of 1,238 DEGs in lung adenocarcinoma when compared with healthy tissues, including 970 upregulated and 268 downregulated DEGs. The six overexpressed outlier genes of ceruloplasmin, heparan sulfate 6‑O‑sulfotransferase 2, transmembrane protease serine 4, anillin actin binding protein, cellular retinoic acid binding protein 2 and cystatin SN may serve important roles in the development of lung adenocarcinoma. In addition, the downregulation of carbonic anhydrase 4 and S100 calcium binding protein A12 may render these effective diagnostic biomarkers. The results of the transcriptional regulatory network demonstrated that the hub nodes were sex determining region Y‑box 10, Spi‑B transcription factor and nuclear receptor subfamily 4 group A member 2. The four TFs, forkhead box D1, E74‑like ETS transcription factor 5, homeobox A5 and kruppel‑like factor 5, may warrant future investigations into their function in disease development. In conclusion, the present study provided for further studies a list of candidate genes and TFs for the detection and treatment of lung adenocarcinoma.
Collapse
Affiliation(s)
- Baiwang Chen
- Intensive Care Unit, Jining No. 1 People's Hospital, Jining, Shandong 272011, P.R. China
| | - Shuhong Gao
- Intensive Care Unit, Jining No. 1 People's Hospital, Jining, Shandong 272011, P.R. China
| | - Changwei Ji
- Intensive Care Unit, Jining No. 1 People's Hospital, Jining, Shandong 272011, P.R. China
| | - Ge Song
- Intensive Care Unit, Jining No. 1 People's Hospital, Jining, Shandong 272011, P.R. China
| |
Collapse
|
46
|
Classification and Biomarker Genes Selection for Cancer Gene Expression Data Using Random Forest. IRANIAN JOURNAL OF PATHOLOGY 2017; 12:339-347. [PMID: 29563929 PMCID: PMC5844678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/08/2016] [Accepted: 05/13/2017] [Indexed: 12/02/2022]
Abstract
BACKGROUND & OBJECTIVE Microarray and next generation sequencing (NGS) data are the important sources to find helpful molecular patterns. Also, the great number of gene expression data increases the challenge of how to identify the biomarkers associated with cancer. The random forest (RF) is used to effectively analyze the problems of large-p and small-n. Therefore, RF can be used to select and rank the genes for the diagnosis and effective treatment of cancer. METHODS The microarray gene expression data of colon, leukemia, and prostate cancers were collected from public databases. Primary preprocessing was done on them using limma package, and then, the RF classification method was implemented on datasets separately in R software. Finally, the selected genes in each of the cancers were evaluated and compared with those of previous experimental studies and their functionalities were assessed in molecular cancer processes. RESULT The RF method extracted very small sets of genes while it retained its predictive performance. About colon cancer data set DIEXF, GUCA2A, CA7, and IGHA1 key genes with the accuracy of 87.39 and precision of 85.45 were selected. The SNCA, USP20, and SNRPA1 genes were selected for prostate cancer with the accuracy of 73.33 and precision of 66.67. Also, key genes of leukemia data set were BAG4, ANKHD1-EIF4EBP3, PLXNC1, and PCDH9 genes, and the accuracy and precision were 100 and 95.24, respectively. CONCLUSION The current study results showed most of the selected genes involved in the processes and cancerous pathways were previously reported and had an important role in shifting from normal cell to abnormal.
Collapse
|
47
|
Zhao J, Bodner G, Rewald B. Phenotyping: Using Machine Learning for Improved Pairwise Genotype Classification Based on Root Traits. FRONTIERS IN PLANT SCIENCE 2016; 7:1864. [PMID: 27999587 PMCID: PMC5138212 DOI: 10.3389/fpls.2016.01864] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/25/2016] [Accepted: 11/25/2016] [Indexed: 05/29/2023]
Abstract
Phenotyping local crop cultivars is becoming more and more important, as they are an important genetic source for breeding - especially in regard to inherent root system architectures. Machine learning algorithms are promising tools to assist in the analysis of complex data sets; novel approaches are need to apply them on root phenotyping data of mature plants. A greenhouse experiment was conducted in large, sand-filled columns to differentiate 16 European Pisum sativum cultivars based on 36 manually derived root traits. Through combining random forest and support vector machine models, machine learning algorithms were successfully used for unbiased identification of most distinguishing root traits and subsequent pairwise cultivar differentiation. Up to 86% of pea cultivar pairs could be distinguished based on top five important root traits (Timp5) - Timp5 differed widely between cultivar pairs. Selecting top important root traits (Timp) provided a significant improved classification compared to using all available traits or randomly selected trait sets. The most frequent Timp of mature pea cultivars was total surface area of lateral roots originating from tap root segments at 0-5 cm depth. The high classification rate implies that culturing did not lead to a major loss of variability in root system architecture in the studied pea cultivars. Our results illustrate the potential of machine learning approaches for unbiased (root) trait selection and cultivar classification based on rather small, complex phenotypic data sets derived from pot experiments. Powerful statistical approaches are essential to make use of the increasing amount of (root) phenotyping information, integrating the complex trait sets describing crop cultivars.
Collapse
Affiliation(s)
- Jiangsan Zhao
- Department of Forest and Soil Sciences, University of Natural Resources and Life SciencesVienna, Austria
| | - Gernot Bodner
- Division of Agronomy, Department of Crop Sciences, University of Natural Resources and Life SciencesVienna, Austria
| | - Boris Rewald
- Department of Forest and Soil Sciences, University of Natural Resources and Life SciencesVienna, Austria
| |
Collapse
|
48
|
Dietrich S, Floegel A, Troll M, Kühn T, Rathmann W, Peters A, Sookthai D, von Bergen M, Kaaks R, Adamski J, Prehn C, Boeing H, Schulze MB, Illig T, Pischon T, Knüppel S, Wang-Sattler R, Drogan D. Random Survival Forest in practice: a method for modelling complex metabolomics data in time to event analysis. Int J Epidemiol 2016; 45:1406-1420. [PMID: 27591264 DOI: 10.1093/ije/dyw145] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/20/2016] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND The application of metabolomics in prospective cohort studies is statistically challenging. Given the importance of appropriate statistical methods for selection of disease-associated metabolites in highly correlated complex data, we combined random survival forest (RSF) with an automated backward elimination procedure that addresses such issues. METHODS Our RSF approach was illustrated with data from the European Prospective Investigation into Cancer and Nutrition (EPIC)-Potsdam study, with concentrations of 127 serum metabolites as exposure variables and time to development of type 2 diabetes mellitus (T2D) as outcome variable. Out of this data set, Cox regression with a stepwise selection method was recently published. Replication of methodical comparison (RSF and Cox regression) was conducted in two independent cohorts. Finally, the R-code for implementing the metabolite selection procedure into the RSF-syntax is provided. RESULTS The application of the RSF approach in EPIC-Potsdam resulted in the identification of 16 incident T2D-associated metabolites which slightly improved prediction of T2D when used in addition to traditional T2D risk factors and also when used together with classical biomarkers. The identified metabolites partly agreed with previous findings using Cox regression, though RSF selected a higher number of highly correlated metabolites. CONCLUSIONS The RSF method appeared to be a promising approach for identification of disease-associated variables in complex data with time to event as outcome. The demonstrated RSF approach provides comparable findings as the generally used Cox regression, but also addresses the problem of multicollinearity and is suitable for high-dimensional data.
Collapse
Affiliation(s)
- Stefan Dietrich
- Department of Epidemiology, German Institute of Human Nutrition, Nuthetal, Germany
| | - Anna Floegel
- Department of Epidemiology, German Institute of Human Nutrition, Nuthetal, Germany
| | - Martina Troll
- Research Unit of Molecular Epidemiology.,Institute of Epidemiology II, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Tilman Kühn
- Division of Cancer Epidemiology, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Wolfgang Rathmann
- Institute for Biometrics and Epidemiology, Leibniz Center for Diabetes Research at Heinrich Heine University, Germany.,German Center for Diabetes Research (DZD), München-Neuherberg, Germany
| | - Anette Peters
- Institute of Epidemiology II, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,German Center for Diabetes Research (DZD), München-Neuherberg, Germany.,Department of Environmental Health, Harvard School of Public Health, Boston, MA, USA and
| | - Disorn Sookthai
- Division of Cancer Epidemiology, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Martin von Bergen
- Department of Molecular Systems Biology, Helmholtz Centre for Environmental Research (UFZ), Institute of Biochemistry, Faculty of Biosciences, Pharmacy and Psychology, University of Leipzig, Leipzig, Germany and Department of Chemistry and Bioscience, University of Aalborg, Aalborg East, Denmark
| | - Rudolf Kaaks
- Division of Cancer Epidemiology, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Jerzy Adamski
- German Center for Diabetes Research (DZD), München-Neuherberg, Germany.,Institute of Experimental Genetics, Genome Analysis Center, Helmholtz Zentrum München, German Research Center for Environmental Health, München-Neuherberg, Germany.,Lehrstuhl für Experimentelle Genetik, Technische Universität München, Freising-Weihenstephan, Germany
| | - Cornelia Prehn
- Institute of Experimental Genetics, Genome Analysis Center, Helmholtz Zentrum München, German Research Center for Environmental Health, München-Neuherberg, Germany
| | - Heiner Boeing
- Department of Epidemiology, German Institute of Human Nutrition, Nuthetal, Germany
| | - Matthias B Schulze
- German Center for Diabetes Research (DZD), München-Neuherberg, Germany.,Department of Molecular Epidemiology, German Institute of Human Nutrition, Nuthetal, Germany
| | - Thomas Illig
- Research Unit of Molecular Epidemiology.,Hannover Unified Biobank, and Institute for Human Genetics, Hannover, Germany
| | - Tobias Pischon
- Department of Epidemiology, German Institute of Human Nutrition, Nuthetal, Germany.,Molecular Epidemiology Group, Max Delbruck Center for Molecular Medicine (MDC) Berlin-Buch, Berlin, Germany
| | - Sven Knüppel
- Department of Epidemiology, German Institute of Human Nutrition, Nuthetal, Germany
| | - Rui Wang-Sattler
- Research Unit of Molecular Epidemiology.,Institute of Epidemiology II, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,German Center for Diabetes Research (DZD), München-Neuherberg, Germany
| | - Dagmar Drogan
- Department of Epidemiology, German Institute of Human Nutrition, Nuthetal, Germany
| |
Collapse
|
49
|
Ma C, Sastry KS, Flore M, Gehani S, Al-Bozom I, Feng Y, Serpedin E, Chouchane L, Chen Y, Huang Y. CrossLink: a novel method for cross-condition classification of cancer subtypes. BMC Genomics 2016; 17 Suppl 7:549. [PMID: 27556419 PMCID: PMC5001207 DOI: 10.1186/s12864-016-2903-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We considered the prediction of cancer classes (e.g. subtypes) using patient gene expression profiles that contain both systematic and condition-specific biases when compared with the training reference dataset. The conventional normalization-based approaches cannot guarantee that the gene signatures in the reference and prediction datasets always have the same distribution for all different conditions as the class-specific gene signatures change with the condition. Therefore, the trained classifier would work well under one condition but not under another. METHODS To address the problem of current normalization approaches, we propose a novel algorithm called CrossLink (CL). CL recognizes that there is no universal, condition-independent normalization mapping of signatures. In contrast, it exploits the fact that the signature is unique to its associated class under any condition and thus employs an unsupervised clustering algorithm to discover this unique signature. RESULTS We assessed the performance of CL for cross-condition predictions of PAM50 subtypes of breast cancer by using a simulated dataset modeled after TCGA BRCA tumor samples with a cross-validation scheme, and datasets with known and unknown PAM50 classification. CL achieved prediction accuracy >73 %, highest among other methods we evaluated. We also applied the algorithm to a set of breast cancer tumors derived from Arabic population to assign a PAM50 classification to each tumor based on their gene expression profiles. CONCLUSIONS A novel algorithm CrossLink for cross-condition prediction of cancer classes was proposed. In all test datasets, CL showed robust and consistent improvement in prediction performance over other state-of-the-art normalization and classification algorithms.
Collapse
Affiliation(s)
- Chifeng Ma
- Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, TX, USA
| | - Konduru S Sastry
- Weill Cornell Medicine-Qatar, Doha, Qatar.,Division of Translational Medicine, Sidra Medical and Research Center, Doha, Qatar
| | - Mario Flore
- Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, TX, USA
| | | | | | - Yusheng Feng
- Department of Mechanical Engineering, University of Texas at San Antonio, San Antonio, TX, USA
| | - Erchin Serpedin
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA
| | | | - Yidong Chen
- Department of Epidemiology and Biostatistics, University of Texas Health Science Center at San Antonio, San Antonio, TX, USA.,Greehey Children Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX, USA
| | - Yufei Huang
- Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, TX, USA. .,Greehey Children Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX, USA.
| |
Collapse
|
50
|
Mapiye DS, Christoffels AG, Gamieldien J. Identification of phenotype-relevant differentially expressed genes in breast cancer demonstrates enhanced quantile discretization protocol's utility in multi-platform microarray data integration. J Bioinform Comput Biol 2016; 14:1650022. [PMID: 27411306 DOI: 10.1142/s0219720016500220] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Microarray for transcriptomics experiments often suffer from limited statistical power due to small sample size. Quantile discretization (QD) maps expression values for a sample into a series of equivalently sized 'bins' that represent a discrete numerical range, e.g. [Formula: see text]4 to [Formula: see text]4, which enables normalized data from multiple experiments and/or expression platforms to be combined for re-analysis. We found, however, that informal selection of bin numbers often resulted in loss of the underlying correlation structure in the data through assigning of the same numerical value to genes that are in reality expressed at significantly different levels within a sample. Here we report a procedure for determining an optimal bin number for dataset. Applying this to integrated public breast cancer datasets enabled statistical identification of several differentially expressed tumorigenesis-related genes that were not found when analyzing the individual datasets, and also several cancer biomarkers not previously indicated as having utility in the disease. Notably, differential modulation of translational control and protein synthesis via multiple pathways were found to potentially have central roles in breast cancer development and progression. These findings suggest that our protocol has significant utility in making meaningful novel biomedical discoveries by leveraging the large public expression data repositories.
Collapse
Affiliation(s)
- Darlington S Mapiye
- 1 South African National Bioinformatics Institute/MRC, Unit for Bioinformatics Capacity Development, University of the Western Cape, Private Bag X17, Bellville 7535, South Africa
| | - Alan G Christoffels
- 1 South African National Bioinformatics Institute/MRC, Unit for Bioinformatics Capacity Development, University of the Western Cape, Private Bag X17, Bellville 7535, South Africa
| | - Junaid Gamieldien
- 1 South African National Bioinformatics Institute/MRC, Unit for Bioinformatics Capacity Development, University of the Western Cape, Private Bag X17, Bellville 7535, South Africa
| |
Collapse
|