Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Jiang H, Deng Y, Chen HS, Tao L, Sha Q, Chen J, Tsai CJ, Zhang S. Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics 2004;5:81. [PMID: 15217521 PMCID: PMC476733 DOI: 10.1186/1471-2105-5-81] [Citation(s) in RCA: 194] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2004] [Accepted: 06/24/2004] [Indexed: 11/29/2022] Open

For:	Jiang H, Deng Y, Chen HS, Tao L, Sha Q, Chen J, Tsai CJ, Zhang S. Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics 2004;5:81. [PMID: 15217521 PMCID: PMC476733 DOI: 10.1186/1471-2105-5-81] [Citation(s) in RCA: 194] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2004] [Accepted: 06/24/2004] [Indexed: 11/29/2022] Open

Number

Cited by Other Article(s)

Javed MF, Fawad M, Lodhi R, Najeh T, Gamil Y. Forecasting the strength of preplaced aggregate concrete using interpretable machine learning approaches. Sci Rep 2024;14:8381. [PMID: 38600161 PMCID: PMC11006863 DOI: 10.1038/s41598-024-57896-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 03/22/2024] [Indexed: 04/12/2024] Open

Abstract

Preplaced aggregate concrete (PAC) also known as two-stage concrete (TSC) is widely used in construction engineering for various applications. To produce PAC, a mixture of Portland cement, sand, and admixtures is injected into a mold subsequent to the deposition of coarse aggregate. This process complicates the prediction of compressive strength (CS), demanding thorough investigation. Consequently, the emphasis of this study is on enhancing the comprehension of PAC compressive strength using machine learning models. Thirteen models are evaluated with 261 data points and eleven input variables. The result depicts that xgboost demonstrates exceptional accuracy with a correlation coefficient of 0.9791 and a normalized coefficient of determination (R2) of 0.9583. Moreover, Gradient boosting (GB) and Cat boost (CB) also perform well due to its robust performance. In addition, Adaboost, Voting regressor, and Random forest yield precise predictions with low mean absolute error (MAE) and root mean square error (RMSE) values. The sensitivity analysis (SA) reveals the significant impact of key input parameters on overall model sensitivity. Notably, gravel takes the lead with a substantial 44.7% contribution, followed by sand at 19.5%, cement at 15.6%, and Fly ash and GGBS at 5.9% and 5.1%, respectively. The best fit model i.e., XG-Boost model, was employed for SHAP analysis to assess the relative importance of contributing attributes and optimize input variables. The SHAP analysis unveiled the water-to-binder (W/B) ratio, superplasticizer, and gravel as the most significant factors influencing the CS of PAC. Furthermore, graphical user interface (GUI) have been developed for practical applications in predicting concrete strength. This simplifies the process and offers a valuable tool for leveraging the model's potential in the field of civil engineering. This comprehensive evaluation provides valuable insights to researchers and practitioners, empowering them to make informed choices in predicting PAC compressive strength in construction projects. By enhancing the reliability and applicability of predictive models, this study contributes to the field of preplaced aggregate concrete strength prediction.

Collapse

Borisov N, Tkachev V, Simonov A, Sorokin M, Kim E, Kuzmin D, Karademir-Yilmaz B, Buzdin A. Uniformly shaped harmonization combines human transcriptomic data from different platforms while retaining their biological properties and differential gene expression patterns. Front Mol Biosci 2023;10:1237129. [PMID: 37745690 PMCID: PMC10511763 DOI: 10.3389/fmolb.2023.1237129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 08/28/2023] [Indexed: 09/26/2023] Open

Abstract

Introduction: Co-normalization of RNA profiles obtained using different experimental platforms and protocols opens avenue for comprehensive comparison of relevant features like differentially expressed genes associated with disease. Currently, most of bioinformatic tools enable normalization in a flexible format that depends on the individual datasets under analysis. Thus, the output data of such normalizations will be poorly compatible with each other. Recently we proposed a new approach to gene expression data normalization termed Shambhala which returns harmonized data in a uniform shape, where every expression profile is transformed into a pre-defined universal format. We previously showed that following shambhalization of human RNA profiles, overall tissue-specific clustering features are strongly retained while platform-specific clustering is dramatically reduced. Methods: Here, we tested Shambhala performance in retention of fold-change gene expression features and other functional characteristics of gene clusters such as pathway activation levels and predicted cancer drug activity scores. Results: Using 6,793 cancer and 11,135 normal tissue gene expression profiles from the literature and experimental datasets, we applied twelve performance criteria for different versions of Shambhala and other methods of transcriptomic harmonization with flexible output data format. Such criteria dealt with the biological type classifiers, hierarchical clustering, correlation/regression properties, stability of drug efficiency scores, and data quality for using machine learning classifiers. Discussion: Shambhala-2 harmonizer demonstrated the best results with the close to 1 correlation and linear regression coefficients for the comparison of training vs validation datasets and more than two times lesser instability for calculation of drug efficiency scores compared to other methods.

Collapse

Bailey R, Sarkar A, Singh A, Dobra A, Kahveci T. Optimal Supervised Reduction of High Dimensional Transcription Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023;20:3093-3105. [PMID: 37276117 DOI: 10.1109/tcbb.2023.3280557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]

Wang Q, Runhaar J, Kloppenburg M, Boers M, Bijlsma JWJ, Bacardit J, Bierma-Zeinstra SMA. A machine learning approach reveals features related to clinicians' diagnosis of clinically relevant knee osteoarthritis. Rheumatology (Oxford) 2023;62:2732-2739. [PMID: 36534939 DOI: 10.1093/rheumatology/keac707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 12/09/2022] [Indexed: 08/03/2023] Open

Chiu Y, Ni C, Huang Y. Deconvolution of bulk gene expression profiles reveals the association between immune cell polarization and the prognosis of hepatocellular carcinoma patients. Cancer Med 2023;12:15736-15760. [PMID: 37366298 PMCID: PMC10417088 DOI: 10.1002/cam4.6197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 05/02/2023] [Accepted: 05/23/2023] [Indexed: 06/28/2023] Open

Shams B, Reisch K, Vajkoczy P, Lippert C, Picht T, Fekonja LS. Improved prediction of glioma-related aphasia by diffusion MRI metrics, machine learning, and automated fiber bundle segmentation. Hum Brain Mapp 2023. [PMID: 37318944 PMCID: PMC10365236 DOI: 10.1002/hbm.26393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 05/07/2023] [Accepted: 05/26/2023] [Indexed: 06/17/2023] Open

Liu Z, Zhang T, Lin L, Long F, Guo H, Han L. Applications of radiomics-based analysis pipeline for predicting epidermal growth factor receptor mutation status. Biomed Eng Online 2023;22:17. [PMID: 36810090 PMCID: PMC9945395 DOI: 10.1186/s12938-022-01049-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2022] [Accepted: 11/04/2022] [Indexed: 02/24/2023] Open

Kheyfets VO, Sweatt AJ, Gomberg-Maitland M, Ivy DD, Condliffe R, Kiely DG, Lawrie A, Maron BA, Zamanian RT, Stenmark KR. Computational platform for doctor-artificial intelligence cooperation in pulmonary arterial hypertension prognostication: a pilot study. ERJ Open Res 2023;9:00484-2022. [PMID: 36776484 PMCID: PMC9907150 DOI: 10.1183/23120541.00484-2022] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Accepted: 10/20/2022] [Indexed: 11/25/2022] Open

Abstract

Background

Pulmonary arterial hypertension (PAH) is a heterogeneous and complex pulmonary vascular disease associated with substantial morbidity. Machine-learning algorithms (used in many PAH risk calculators) can combine established parameters with thousands of circulating biomarkers to optimise PAH prognostication, but these approaches do not offer the clinician insight into what parameters drove the prognosis. The approach proposed in this study diverges from other contemporary phenotyping methods by identifying patient-specific parameters driving clinical risk.

Methods

We trained a random forest algorithm to predict 4-year survival risk in a cohort of 167 adult PAH patients evaluated at Stanford University, with 20% withheld for (internal) validation. Another cohort of 38 patients from Sheffield University were used as a secondary (external) validation. Shapley values, borrowed from game theory, were computed to rank the input parameters based on their importance to the predicted risk score for the entire trained random forest model (global importance) and for an individual patient (local importance).

Results

Between the internal and external validation cohorts, the random forest model predicted 4-year risk of death/transplant with sensitivity and specificity of 71.0-100% and 81.0-89.0%, respectively. The model reinforced the importance of established prognostic markers, but also identified novel inflammatory biomarkers that predict risk in some PAH patients.

Conclusion

These results stress the need for advancing individualised phenotyping strategies that integrate clinical and biochemical data with outcome. The computational platform presented in this study offers a critical step towards personalised medicine in which a clinician can interpret an algorithm's assessment of an individual patient.

Collapse

Affiliation(s)

Vitaly O. Kheyfets Paediatric Critical Care Medicine, Developmental Lung Biology and CVP Research Laboratories, School of Medicine, University of Colorado, Aurora, CO, USA
Andrew J. Sweatt Division of Pulmonary and Critical Care Medicine, Stanford University, Stanford, CA, USA Vera Moulton Wall Center for Pulmonary Vascular Disease, Stanford University, Stanford, CA, USA
Mardi Gomberg-Maitland Division of Cardiology, George Washington University Hospital, Washington, DC, USA
Dunbar D. Ivy Department of Paediatric Cardiology, Children's Hospital Colorado, Aurora, CO, USA
Robin Condliffe Sheffield Pulmonary Vascular Disease Unit, Sheffield Teaching Hospitals NHS Foundation Trust, Royal Hallamshire Hospital, Sheffield, UK
David G. Kiely Sheffield Pulmonary Vascular Disease Unit, Sheffield Teaching Hospitals NHS Foundation Trust, Royal Hallamshire Hospital, Sheffield, UK Department of Infection, Immunity and Cardiovascular Disease, University of Sheffield, Sheffield, UK Insigneo Institute for in-silico Medicine, University of Sheffield, Sheffield, UK
Allan Lawrie Sheffield Pulmonary Vascular Disease Unit, Sheffield Teaching Hospitals NHS Foundation Trust, Royal Hallamshire Hospital, Sheffield, UK Department of Infection, Immunity and Cardiovascular Disease, University of Sheffield, Sheffield, UK Insigneo Institute for in-silico Medicine, University of Sheffield, Sheffield, UK
Bradley A. Maron Division of Cardiovascular Medicine, Brigham and Women's Hospital and Harvard Medical School, Harvard University, Boston, MA, USA
Roham T. Zamanian Division of Pulmonary and Critical Care Medicine, Stanford University, Stanford, CA, USA Vera Moulton Wall Center for Pulmonary Vascular Disease, Stanford University, Stanford, CA, USA
Kurt R. Stenmark Paediatric Critical Care Medicine, Developmental Lung Biology and CVP Research Laboratories, School of Medicine, University of Colorado, Aurora, CO, USA

Collapse

Mapping Mediterranean maquis formations using Sentinel-2 time-series. ECOL INFORM 2022. [DOI: 10.1016/j.ecoinf.2022.101814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Borisov N, Buzdin A. Transcriptomic Harmonization as the Way for Suppressing Cross-Platform Bias and Batch Effect. Biomedicines 2022;10:2318. [PMID: 36140419 PMCID: PMC9496268 DOI: 10.3390/biomedicines10092318] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Revised: 09/14/2022] [Accepted: 09/16/2022] [Indexed: 11/16/2022] Open

Soriano MA, Deziel NC, Saiers JE. Regional Scale Assessment of Shallow Groundwater Vulnerability to Contamination from Unconventional Hydrocarbon Extraction. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2022;56:12126-12136. [PMID: 35960643 PMCID: PMC9454823 DOI: 10.1021/acs.est.2c00470] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Revised: 07/29/2022] [Accepted: 08/01/2022] [Indexed: 05/19/2023]

Huang HH, Rao H, Miao R, Liang Y. A novel meta-analysis based on data augmentation and elastic data shared lasso regularization for gene expression. BMC Bioinformatics 2022;23:353. [PMID: 35999505 PMCID: PMC9396780 DOI: 10.1186/s12859-022-04887-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 08/10/2022] [Indexed: 12/22/2022] Open

Abstract

Background

Gene expression analysis can provide useful information for analyzing complex biological mechanisms. However, many reported findings are unrepeatable due to small sample sizes relative to a large number of genes and the low signal-to-noise ratios of most gene expression datasets.

Results

Meta-analysis of multi-data sets is an efficient method for tackling the above problem. To improve the performance of meta-analysis, we propose a novel meta-analysis framework. It consists of two parts: (1) a novel data augmentation strategy. Various cross-platform normalization methods exist, which can preserve original biological information of gene expression datasets from different angles and add different “perturbations” to the dataset. Using such perturbation, we provide a feasible means for gene expression data augmentation; (2) elastic data shared lasso (DSL-\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\varvec{L}}}_{\mathbf{2}}$$\end{document}L2). The DSL-\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbf{L}}_{\mathbf{2}}$$\end{document}L2 method spans the continuum between individual models for each dataset and one model for all datasets. It also overcomes the shortcomings of the data shared lasso method when dealing with highly correlated features. Comprehensive simulation experiment results show that the proposed method has high prediction and gene selection performance. We then apply the proposed method to non-small cell lung cancer (NSCLC) blood gene expression data in order to identify key tumor-related genes. The outcomes of our experiment indicate that the method could be used for identifying a set of robust disease-related gene signatures that may be used for NSCLC early diagnosis or prognosis or even targeting.

Conclusion

We propose a novel and effective meta-analysis method for biological research, extrapolating and integrating information from multiple gene expression datasets.

Collapse

Invasion success of a freshwater fish corresponds to low dissolved oxygen and diminished riparian integrity. Biol Invasions 2022. [DOI: 10.1007/s10530-022-02827-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]

Ganaie M, Tanveer M, Suganthan P, Snasel V. Oblique and rotation double random forest. Neural Netw 2022;153:496-517. [DOI: 10.1016/j.neunet.2022.06.012] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Revised: 05/25/2022] [Accepted: 06/09/2022] [Indexed: 10/18/2022]

An iterative model-free feature screening procedure: Forward recursive selection. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108745] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]

Borisov N, Sorokin M, Zolotovskaya M, Borisov C, Buzdin A. Shambhala-2: A Protocol for Uniformly Shaped Harmonization of Gene Expression Profiles of Various Formats. Curr Protoc 2022;2:e444. [PMID: 35617464 DOI: 10.1002/cpz1.444] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]

Abstract

Uniformly shaped harmonization of gene expression profiles is central for the simultaneous comparison of multiple gene expression datasets. It is expected to operate with the gene expression data obtained using various experimental methods and equipment, and to return harmonized profiles in a uniform shape. Such uniformly shaped expression profiles from different initial datasets can be further compared directly. However, current harmonization techniques have strong limitations that prevent their broad use for bioinformatic applications. They can either operate with only up to two datasets/platforms or return data in a dynamic format that will be different for every comparison under analysis. This also does not allow for adding new data to the previously harmonized dataset(s), which complicates the analysis and increases calculation costs. We propose here a new method termed Shambhala-2 that can transform multi-platform expression data into a universal format that is identical for all harmonizations made using this technique. Shambhala-2 is based on sample-by-sample cubic conversion of the initial expression dataset into a preselected shape of the reference definitive dataset. Using 8390 samples of 12 healthy human tissue types and 4086 samples of colorectal, kidney, and lung cancer tissues, we verified Shambhala-2's capacity in restoring tissue-specific expression patterns for seven microarray and three RNA sequencing platforms. Shambhala-2 performed well for all tested combinations of RNAseq and microarray profiles, and retained gene-expression ranks, as evidenced by high correlations between different single- or aggregated gene expression metrics in pre- and post-Shambhalized samples, including preserving cancer-specific gene expression and pathway activation features. © 2022 Wiley Periodicals LLC. Basic Protocol: Shambhala-2 harmonizer Alternate Protocol 1: Linear Shambhala/Shambhala-1 Alternate Protocol 2: Alternative (flexible-format and uniformly shaped) normalization methods Support Protocol 1: Watermelon multisection (WM) Support Protocol 2: Calculation of cancer-to-normal log-fold-change (LFC) and pathway activation level (PAL).

Collapse

Guragain P, Båtnes AS, Zobolas J, Olsen Y, Bones AM, Winge P. IIb-RAD-sequencing coupled with random forest classification indicates regional population structuring and sex-specific differentiation in salmon lice (Lepeophtheirus salmonis). Ecol Evol 2022;12:e8809. [PMID: 35414904 PMCID: PMC8986551 DOI: 10.1002/ece3.8809] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 03/18/2022] [Accepted: 03/22/2022] [Indexed: 11/29/2022] Open

Eggers B, Schork K, Turewicz M, Barkovits K, Eisenacher M, Schröder R, Clemen CS, Marcus K. Advanced Fiber Type-Specific Protein Profiles Derived from Adult Murine Skeletal Muscle. Proteomes 2021;9:proteomes9020028. [PMID: 34201234 PMCID: PMC8293376 DOI: 10.3390/proteomes9020028] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Revised: 06/01/2021] [Accepted: 06/02/2021] [Indexed: 02/07/2023] Open

Affiliation(s)

Britta Eggers Medizinisches Proteom-Center, Medical Faculty, Ruhr-University Bochum, 44801 Bochum, Germany; (K.S.); (M.T.); (K.B.); (M.E.) Medical Proteome Analysis, Center for Protein Diagnostics (PRODI), Ruhr-University Bochum, 44801 Bochum, Germany Correspondence: (B.E.); (K.M.)
Karin Schork Medizinisches Proteom-Center, Medical Faculty, Ruhr-University Bochum, 44801 Bochum, Germany; (K.S.); (M.T.); (K.B.); (M.E.) Medical Proteome Analysis, Center for Protein Diagnostics (PRODI), Ruhr-University Bochum, 44801 Bochum, Germany
Michael Turewicz Medizinisches Proteom-Center, Medical Faculty, Ruhr-University Bochum, 44801 Bochum, Germany; (K.S.); (M.T.); (K.B.); (M.E.) Medical Proteome Analysis, Center for Protein Diagnostics (PRODI), Ruhr-University Bochum, 44801 Bochum, Germany
Katalin Barkovits Medizinisches Proteom-Center, Medical Faculty, Ruhr-University Bochum, 44801 Bochum, Germany; (K.S.); (M.T.); (K.B.); (M.E.) Medical Proteome Analysis, Center for Protein Diagnostics (PRODI), Ruhr-University Bochum, 44801 Bochum, Germany
Martin Eisenacher Medizinisches Proteom-Center, Medical Faculty, Ruhr-University Bochum, 44801 Bochum, Germany; (K.S.); (M.T.); (K.B.); (M.E.) Medical Proteome Analysis, Center for Protein Diagnostics (PRODI), Ruhr-University Bochum, 44801 Bochum, Germany
Rolf Schröder Institute of Neuropathology, University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nürnberg, 91054 Erlangen, Germany;
Christoph S. Clemen German Aerospace Center, Institute of Aerospace Medicine, 51147 Cologne, Germany; Center for Physiology and Pathophysiology, Institute of Vegetative Physiology, Medical Faculty, University of Cologne, 50931 Cologne, Germany
Katrin Marcus Medizinisches Proteom-Center, Medical Faculty, Ruhr-University Bochum, 44801 Bochum, Germany; (K.S.); (M.T.); (K.B.); (M.E.) Medical Proteome Analysis, Center for Protein Diagnostics (PRODI), Ruhr-University Bochum, 44801 Bochum, Germany Correspondence: (B.E.); (K.M.)

Collapse

Speiser JL. A random forest method with feature selection for developing medical prediction models with clustered and longitudinal data. J Biomed Inform 2021;117:103763. [PMID: 33781921 PMCID: PMC8131242 DOI: 10.1016/j.jbi.2021.103763] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Revised: 03/03/2021] [Accepted: 03/23/2021] [Indexed: 12/22/2022]

Abstract

BACKGROUND

Machine learning methodologies are gaining popularity for developing medical prediction models for datasets with a large number of predictors, particularly in the setting of clustered and longitudinal data. Binary Mixed Model (BiMM) forest is a promising machine learning algorithm which may be applied to develop prediction models for clustered and longitudinal binary outcomes. Although machine learning methods for clustered and longitudinal methods such as BiMM forest exist, feature selection has not been analyzed via data simulations. Feature selection improves the practicality and ease of use of prediction models for clinicians by reducing the burden of data collection. Thus, feature selection procedures are not only beneficial, but are often necessary for development of medical prediction models. In this study, we aim to assess feature selection within the BiMM forest setting for modeling clustered and longitudinal binary outcomes.

METHODS

We conducted a simulation study to compare BiMM forest with feature selection (backward elimination or stepwise selection) to standard generalized linear mixed model feature selection methods (shrinkage and backward elimination). We also evaluated feature selection methods to develop models predicting mobility disability in older adults using the Health, Aging and Body Composition Study dataset as an example utilization of the proposed methodology.

RESULTS

BiMM forest with backward elimination generally offered higher computational efficiency, similar or higher predictive performance (accuracy and area under the receiver operating curve), and similar or higher ability to identify correct features compared to linear methods for the different simulated scenarios. For predicting mobility disability in older adults, methods generally performed similarly in terms of accuracy, area under the receiver operating curve, and specificity; however, BiMM forest with backward elimination had the highest sensitivity.

CONCLUSIONS

This study is novel because it is the first investigation of feature selection for developing random forest prediction models for clustered and longitudinal binary outcomes. Results from the simulation study reveal that BiMM forest with backward elimination has the highest accuracy (performance and identification of correct features) and lowest computation time compared to other feature selection methods in some scenarios and similar performance in other scenarios. Many informatics datasets have clustered and longitudinal outcomes and results from this study suggest that BiMM forest with backward elimination may be beneficial for developing medical prediction models.

Collapse

Yaşar Ş, Çolak C, Yoloğlu S. Artificial Intelligence-Based Prediction of Covid-19 Severity on the Results of Protein Profiling. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021;202:105996. [PMID: 33631640 PMCID: PMC7882428 DOI: 10.1016/j.cmpb.2021.105996] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Accepted: 02/06/2021] [Indexed: 05/21/2023]

Use of Machine Learning to Determine the Information Value of a BMI Screening Program. Am J Prev Med 2021;60:425-433. [PMID: 33483154 PMCID: PMC8610445 DOI: 10.1016/j.amepre.2020.10.016] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/28/2020] [Revised: 10/13/2020] [Accepted: 10/14/2020] [Indexed: 12/12/2022]

Suzuki T, Kano S, Suzuki M, Yasukawa S, Mizumachi T, Tsushima N, Hatanaka KC, Hatanaka Y, Matsuno Y, Homma A. Enhanced Angiogenesis in Salivary Duct Carcinoma Ex-Pleomorphic Adenoma. Front Oncol 2021;10:603717. [PMID: 33692941 PMCID: PMC7937931 DOI: 10.3389/fonc.2020.603717] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2020] [Accepted: 12/30/2020] [Indexed: 11/23/2022] Open

Myall AC, Perkins S, Rushton D, David J, Spencer P, Jones AR, Antczak P. An OMICs based meta-analysis to support infection state stratification. Bioinformatics 2021;37:2347-2355. [PMID: 33560295 PMCID: PMC8388022 DOI: 10.1093/bioinformatics/btab089] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 01/06/2021] [Accepted: 01/24/2021] [Indexed: 11/13/2022] Open

Abstract

MOTIVATION

A fundamental problem for disease treatment is that while antibiotics are a powerful counter to bacteria, they are ineffective against viruses. Often, bacterial and viral infections are confused due to their similar symptoms and lack of rapid diagnostics. With many clinicians relying primarily on symptoms for diagnosis, overuse and misuse of modern antibiotics are rife, contributing to the growing pool of antibiotic resistance. To ensure an individual receives optimal treatment given their disease state and to reduce over-prescription of antibiotics, the host response can in theory be measured quickly to distinguish between the two states. To establish a predictive biomarker panel of disease state (viral/bacterial/no-infection) we conducted a meta-analysis of human blood infection studies using Machine Learning (ML).

RESULTS

We focused on publicly available gene expression data from two widely used platforms, Affymetrix and Illumina microarrays as they represented a significant proportion of the available data. We were able to develop multi-class models with high accuracies with our best model predicting 93% of bacterial and 89% viral samples correctly. To compare the selected features in each of the different technologies, we reverse engineered the underlying molecular regulatory network and explored the neighbourhood of the selected features. The networks highlighted that although on the gene-level the models differed, they contained genes from the same areas of the network. Specifically, this convergence was to pathways including the Type I interferon Signalling Pathway, Chemotaxis, Apoptotic Processes, and Inflammatory/Innate Response.

AVAILABILITY

Data and code are available on the Gene Expression Omnibus and github.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Yu Z, Fu Y, Ai J, Zhang J, Huang G, Deng Y. Development of predicitve models to distinguish metals from non-metal toxicants, and individual metal from one another. BMC Bioinformatics 2020;21:239. [PMID: 33272211 PMCID: PMC7712572 DOI: 10.1186/s12859-020-3525-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2020] [Accepted: 04/29/2020] [Indexed: 11/29/2022] Open

Hu L, Liu B, Ji J, Li Y. Tree-Based Machine Learning to Identify and Understand Major Determinants for Stroke at the Neighborhood Level. J Am Heart Assoc 2020;9:e016745. [PMID: 33140687 PMCID: PMC7763737 DOI: 10.1161/jaha.120.016745] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]

Abstract

Background

Stroke is a major cardiovascular disease that causes significant health and economic burden in the United States. Neighborhood community‐based interventions have been shown to be both effective and cost‐effective in preventing cardiovascular disease. There is a dearth of robust studies identifying the key determinants of cardiovascular disease and the underlying effect mechanisms at the neighborhood level. We aim to contribute to the evidence base for neighborhood cardiovascular health research.

Methods and Results

We created a new neighborhood health data set at the census tract level by integrating 4 types of potential predictors, including unhealthy behaviors, prevention measures, sociodemographic factors, and environmental measures from multiple data sources. We used 4 tree‐based machine learning techniques to identify the most critical neighborhood‐level factors in predicting the neighborhood‐level prevalence of stroke, and compared their predictive performance for variable selection. We further quantified the effects of the identified determinants on stroke prevalence using a Bayesian linear regression model. Of the 5 most important predictors identified by our method, higher prevalence of low physical activity, larger share of older adults, higher percentage of non‐Hispanic Black people, and higher ozone levels were associated with higher prevalence of stroke at the neighborhood level. Higher median household income was linked to lower prevalence. The most important interaction term showed an exacerbated adverse effect of aging and low physical activity on the neighborhood‐level prevalence of stroke.

Conclusions

Tree‐based machine learning provides insights into underlying drivers of neighborhood cardiovascular health by discovering the most important determinants from a wide range of factors in an agnostic, data‐driven, and reproducible way. The identified major determinants and the interactive mechanism can be used to prioritize and allocate resources to optimize community‐level interventions for stroke prevention.

Collapse

S V, A J, R S, Mohan S, Bhattacharya S, Kaluri R, Feng G, Tariq U. Multi-modal prediction of breast cancer using particle swarm optimization with non-dominating sorting. INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS 2020;16:155014772097150. [DOI: 10.1177/1550147720971505] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/15/2023]

Abstract Cancer is enlisted as the second leading reason for death across the world wherein almost one person out of six dies of cancer. Breast cancer is one of the most common forms of cancer predominant in women having the second highest mortality rate in the world. Various scientific studies have been conducted to combat this disease, and machine learning approaches have been an extremely popular choice. Particle swarm optimization has been identified as one of the most powerful and efficient technique for the diagnosis of breast cancer guiding physicians towards timely and accurate treatment. It is also pertinent to mention that multi-modal prediction methods are used to make decisions depending upon different scenarios and aspects whereas the non-dominating sorting feature is useful to sort different objects based on differing requirements. The main novelty of this work is multi-modal prediction algorithm for breast cancer prediction is proposed. The work encompasses the use of particle swarm optimization, non-dominating sorting and multi-classifier techniques, namely, k-nearest neighbour method, fast decision tree and kernel density estimation. Finally, Bayes’ theorem is implemented for revising the results to achieve optimum accuracy in the breast cancer prediction. The proposed particle swarm optimization and non-domination sorting with classifier technique model helps to select the most significant features relevant to breast cancer predictions. The selected features design the objective of the problem model. The proposed model is implemented on the WBCD and WDBC breast cancer data sets publicly available from the UCI machine learning data repository. The metrics considered are sensitivity, specificity, accuracy and time complexity. The experimental results of the study using measures such as sensitivity, specificity, accuracy and time complexity. The experimental results of the study are evaluated against the state-of-the-art algorithms, namely, genetic algorithm kernel density estimation and particle swarm optimization kernel density estimation wherein the results justify the superiority of the proposed model. Collapse

A Comparative Study of Random Forest and Genetic Engineering Programming for the Prediction of Compressive Strength of High Strength Concrete (HSC). APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10207330] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]

Zhang S, Shao J, Yu D, Qiu X, Zhang J. MatchMixeR: a cross-platform normalization method for gene expression data integration. Bioinformatics 2020;36:2486-2491. [PMID: 31904810 DOI: 10.1093/bioinformatics/btz974] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 09/19/2019] [Accepted: 12/31/2019] [Indexed: 01/18/2023] Open

Ke H, Wu Y, Wang R, Wu X. Creation of a Prognostic Risk Prediction Model for Lung Adenocarcinoma Based on Gene Expression, Methylation, and Clinical Characteristics. Med Sci Monit 2020;26:e925833. [PMID: 33021972 PMCID: PMC7549534 DOI: 10.12659/msm.925833] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open

Zhang J, Xu D, Hao K, Zhang Y, Chen W, Liu J, Gao R, Wu C, De Marinis Y. FS-GBDT: identification multicancer-risk module via a feature selection algorithm by integrating Fisher score and GBDT. Brief Bioinform 2020;22:5901960. [PMID: 34020547 DOI: 10.1093/bib/bbaa189] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Revised: 07/03/2020] [Accepted: 07/21/2020] [Indexed: 11/14/2022] Open

Accurate Nonendoscopic Detection of Barrett's Esophagus by Methylated DNA Markers: A Multisite Case Control Study. Am J Gastroenterol 2020;115:1201-1209. [PMID: 32558685 PMCID: PMC7415629 DOI: 10.14309/ajg.0000000000000656] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]

Azodi CB, Tang J, Shiu SH. Opening the Black Box: Interpretable Machine Learning for Geneticists. Trends Genet 2020;36:442-455. [PMID: 32396837 DOI: 10.1016/j.tig.2020.03.005] [Citation(s) in RCA: 114] [Impact Index Per Article: 28.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Revised: 03/12/2020] [Accepted: 03/16/2020] [Indexed: 01/16/2023]

Serra A, Fratello M, Cattelani L, Liampa I, Melagraki G, Kohonen P, Nymark P, Federico A, Kinaret PAS, Jagiello K, Ha MK, Choi JS, Sanabria N, Gulumian M, Puzyn T, Yoon TH, Sarimveis H, Grafström R, Afantitis A, Greco D. Transcriptomics in Toxicogenomics, Part III: Data Modelling for Risk Assessment. NANOMATERIALS (BASEL, SWITZERLAND) 2020;10:E708. [PMID: 32276469 PMCID: PMC7221955 DOI: 10.3390/nano10040708] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 03/25/2020] [Accepted: 03/26/2020] [Indexed: 12/30/2022]

Affiliation(s)

Angela Serra Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.) BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
Michele Fratello Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.) BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
Luca Cattelani Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.) BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
Irene Liampa School of Chemical Engineering, National Technical University of Athens, 157 80 Athens, Greece; (I.L.); (H.S.)
Georgia Melagraki Nanoinformatics Department, NovaMechanics Ltd., Nicosia 1065, Cyprus; (G.M.); (A.A.)
Pekka Kohonen Institute of Environmental Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden; (P.K.); (P.N.); (R.G.) Division of Toxicology, Misvik Biology, 20520 Turku, Finland
Penny Nymark Institute of Environmental Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden; (P.K.); (P.N.); (R.G.) Division of Toxicology, Misvik Biology, 20520 Turku, Finland
Antonio Federico Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.) BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland
Pia Anneli Sofia Kinaret Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.) BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland
Karolina Jagiello QSAR Lab Ltd., Aleja Grunwaldzka 190/102, 80-266 Gdansk, Poland; (K.J.); (T.P.) University of Gdansk, Faculty of Chemistry, Wita Stwosza 63, 80-308 Gdansk, Poland
My Kieu Ha Center for Next Generation Cytometry, Hanyang University, Seoul 04763, Korea; (M.K.H.); (J.-S.C.); (T.-H.Y.) Department of Chemistry, College of Natural Sciences, Hanyang University, Seoul 04763, Korea Institute of Next Generation Material Design, Hanyang University, Seoul 04763, Korea
Jang-Sik Choi Center for Next Generation Cytometry, Hanyang University, Seoul 04763, Korea; (M.K.H.); (J.-S.C.); (T.-H.Y.) Department of Chemistry, College of Natural Sciences, Hanyang University, Seoul 04763, Korea Institute of Next Generation Material Design, Hanyang University, Seoul 04763, Korea
Natasha Sanabria National Institute for Occupational Health, Johannesburg 30333, South Africa; (N.S.); (M.G.)
Mary Gulumian National Institute for Occupational Health, Johannesburg 30333, South Africa; (N.S.); (M.G.) Haematology and Molecular Medicine Department, School of Pathology, University of the Witwatersrand, Johannesburg 2050, South Africa
Tomasz Puzyn QSAR Lab Ltd., Aleja Grunwaldzka 190/102, 80-266 Gdansk, Poland; (K.J.); (T.P.) University of Gdansk, Faculty of Chemistry, Wita Stwosza 63, 80-308 Gdansk, Poland
Tae-Hyun Yoon Center for Next Generation Cytometry, Hanyang University, Seoul 04763, Korea; (M.K.H.); (J.-S.C.); (T.-H.Y.) Department of Chemistry, College of Natural Sciences, Hanyang University, Seoul 04763, Korea Institute of Next Generation Material Design, Hanyang University, Seoul 04763, Korea
Haralambos Sarimveis School of Chemical Engineering, National Technical University of Athens, 157 80 Athens, Greece; (I.L.); (H.S.)
Roland Grafström Institute of Environmental Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden; (P.K.); (P.N.); (R.G.) Division of Toxicology, Misvik Biology, 20520 Turku, Finland
Antreas Afantitis Nanoinformatics Department, NovaMechanics Ltd., Nicosia 1065, Cyprus; (G.M.); (A.A.)
Dario Greco Faculty of Medicine and Health Technology, Tampere University, FI-33014 Tampere, Finland; (A.S.); (M.F.); (L.C.); (A.F.); (P.A.S.K.) BioMediTech Institute, Tampere University, FI-33014 Tampere, Finland Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland

Collapse

Chiu YJ, Hsieh YH, Huang YH. Improved cell composition deconvolution method of bulk gene expression profiles to quantify subsets of immune cells. BMC Med Genomics 2019;12:169. [PMID: 31856824 PMCID: PMC6923925 DOI: 10.1186/s12920-019-0613-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2019] [Accepted: 10/31/2019] [Indexed: 01/07/2023] Open

Abstract

Background

To facilitate the investigation of the pathogenic roles played by various immune cells in complex tissues such as tumors, a few computational methods for deconvoluting bulk gene expression profiles to predict cell composition have been created. However, available methods were usually developed along with a set of reference gene expression profiles consisting of imbalanced replicates across different cell types. Therefore, the objective of this study was to create a new deconvolution method equipped with a new set of reference gene expression profiles that incorporate more microarray replicates of the immune cells that have been frequently implicated in the poor prognosis of cancers, such as T helper cells, regulatory T cells and macrophage M1/M2 cells.

Methods

Our deconvolution method was developed by choosing ε-support vector regression (ε-SVR) as the core algorithm assigned with a loss function subject to the L1-norm penalty. To construct the reference gene expression signature matrix for regression, a subset of differentially expressed genes were chosen from 148 microarray-based gene expression profiles for 9 types of immune cells by using ANOVA and minimizing condition number. Agreement analyses including mean absolute percentage errors and Bland-Altman plots were carried out to compare the performances of our method and CIBERSORT.

Results

In silico cell mixtures, simulated bulk tissues, and real human samples with known immune-cell fractions were used as the test datasets for benchmarking. Our method outperformed CIBERSORT in the benchmarks using in silico breast tissue-immune cell mixtures in the proportions of 30:70 and 50:50, and in the benchmark using 164 human PBMC samples. Our results suggest that the performance of our method was at least comparable to that of a state-of-the-art tool, CIBERSORT.

Conclusions

We developed a new cell composition deconvolution method and the implementation was entirely based on the publicly available R and Python packages. In addition, we compiled a new set of reference gene expression profiles, which might allow for a more robust prediction of the immune cell fractions from the expression profiles of cell mixtures. The source code of our method could be downloaded from https://github.com/holiday01/deconvolution-to-estimate-immune-cell-subsets.

Collapse

Mihaylov I, Kańduła M, Krachunov M, Vassilev D. A novel framework for horizontal and vertical data integration in cancer studies with application to survival time prediction models. Biol Direct 2019;14:22. [PMID: 31752974 PMCID: PMC6868770 DOI: 10.1186/s13062-019-0249-6] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2018] [Accepted: 09/20/2019] [Indexed: 12/17/2022] Open

Abstract

Background

Recently high-throughput technologies have been massively used alongside clinical tests to study various types of cancer. Data generated in such large-scale studies are heterogeneous, of different types and formats. With lack of effective integration strategies novel models are necessary for efficient and operative data integration, where both clinical and molecular information can be effectively joined for storage, access and ease of use. Such models, combined with machine learning methods for accurate prediction of survival time in cancer studies, can yield novel insights into disease development and lead to precise personalized therapies.

Results

We developed an approach for intelligent data integration of two cancer datasets (breast cancer and neuroblastoma) − provided in the CAMDA 2018 ‘Cancer Data Integration Challenge’, and compared models for prediction of survival time. We developed a novel semantic network-based data integration framework that utilizes NoSQL databases, where we combined clinical and expression profile data, using both raw data records and external knowledge sources. Utilizing the integrated data we introduced Tumor Integrated Clinical Feature (TICF) − a new feature for accurate prediction of patient survival time. Finally, we applied and validated several machine learning models for survival time prediction.

Conclusion

We developed a framework for semantic integration of clinical and omics data that can borrow information across multiple cancer studies. By linking data with external domain knowledge sources our approach facilitates enrichment of the studied data by discovery of internal relations. The proposed and validated machine learning models for survival time prediction yielded accurate results.

Reviewers

This article was reviewed by Eran Elhaik, Wenzhong Xiao and Carlos Loucera.

Collapse

Speiser JL, Miller ME, Tooze J, Ip E. A Comparison of Random Forest Variable Selection Methods for Classification Prediction Modeling. EXPERT SYSTEMS WITH APPLICATIONS 2019;134:93-101. [PMID: 32968335 PMCID: PMC7508310 DOI: 10.1016/j.eswa.2019.05.028] [Citation(s) in RCA: 228] [Impact Index Per Article: 45.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]

Buzdin A, Sorokin M, Garazha A, Glusker A, Aleshin A, Poddubskaya E, Sekacheva M, Kim E, Gaifullin N, Giese A, Seryakov A, Rumiantsev P, Moshkovskii S, Moiseev A. RNA sequencing for research and diagnostics in clinical oncology. Semin Cancer Biol 2019;60:311-323. [PMID: 31412295 DOI: 10.1016/j.semcancer.2019.07.010] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2019] [Accepted: 07/16/2019] [Indexed: 12/26/2022]

Acevedo A, Berthel A, DuBois D, Almon RR, Jusko WJ, Androulakis IP. Pathway-Based Analysis of the Liver Response to Intravenous Methylprednisolone Administration in Rats: Acute Versus Chronic Dosing. GENE REGULATION AND SYSTEMS BIOLOGY 2019;13:1177625019840282. [PMID: 31019365 PMCID: PMC6466473 DOI: 10.1177/1177625019840282] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/14/2019] [Accepted: 03/05/2019] [Indexed: 12/25/2022]

Abstract

Pharmacological time-series data, from comparative dosing studies, are critical to characterizing drug effects. Reconciling the data from multiple studies is inevitably difficult; multiple in vivo high-throughput -omics studies are necessary to capture the global and temporal effects of the drug, but these experiments, though analogous, differ in (microarray or other) platforms, time-scales, and dosing regimens and thus cannot be directly combined or compared. This investigation addresses this reconciliation issue with a meta-analysis technique aimed at assessing the intrinsic activity at the pathway level. The purpose of this is to characterize the dosing effects of methylprednisolone (MPL), a widely used anti-inflammatory and immunosuppressive corticosteroid (CS), within the liver. A multivariate decomposition approach is applied to analyze acute and chronic MPL dosing in male adrenalectomized rats and characterize the dosing-dependent differences in the dynamic response of MPL-responsive signaling and metabolic pathways. We demonstrate how to deconstruct signaling and metabolic pathways into their constituent pathway activities, activities which are scored for intrinsic pathway activity. Dosing-induced changes in the dynamics of pathway activities are compared using a model-based assessment of pathway dynamics, extending the principles of pharmacokinetics/pharmacodynamics (PKPD) to describe pathway activities. The model-based approach enabled us to hypothesize on the likely emergence (or disappearance) of indirect dosing-dependent regulatory interactions, pointing to likely mechanistic implications of dosing of MPL transcriptional regulation. Both acute and chronic MPL administration induced a strong core of activity within pathway families including the following: lipid metabolism, amino acid metabolism, carbohydrate metabolism, metabolism of cofactors and vitamins, regulation of essential organelles, and xenobiotic metabolism pathway families. Pathway activities alter between acute and chronic dosing, indicating that MPL response is dosing dependent. Furthermore, because multiple pathway activities are dominant within a single pathway, we observe that pathways cannot be defined by a single response. Instead, pathways are defined by multiple, complex, and temporally related activities corresponding to different subgroups of genes within each pathway.

Collapse

Zhou XH, Chu XY, Xue G, Xiong JH, Zhang HY. Identifying cancer prognostic modules by module network analysis. BMC Bioinformatics 2019;20:85. [PMID: 30777030 PMCID: PMC6380061 DOI: 10.1186/s12859-019-2674-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2017] [Accepted: 02/08/2019] [Indexed: 02/08/2023] Open

Abstract

Background

The identification of prognostic genes that can distinguish the prognostic risks of cancer patients remains a significant challenge. Previous works have proven that functional gene sets were more reliable for this task than the gene signature. However, few works have considered the cross-talk among functional gene sets, which may result in neglecting important prognostic gene sets for cancer.

Results

Here, we proposed a new method that considers both the interactions among modules and the prognostic correlation of the modules to identify prognostic modules in cancers. First, dense sub-networks in the gene co-expression network of cancer patients were detected. Second, cross-talk between every two modules was identified by a permutation test, thus generating the module network. Third, the prognostic correlation of each module was evaluated by the resampling method. Then, the GeneRank algorithm, which takes the module network and the prognostic correlations of all the modules as input, was applied to prioritize the prognostic modules. Finally, the selected modules were validated by survival analysis in various data sets. Our method was applied in three kinds of cancers, and the results show that our method succeeded in identifying prognostic modules in all the three cancers. In addition, our method outperformed state-of-the-art methods. Furthermore, the selected modules were significantly enriched with known cancer-related genes and drug targets of cancer, which may indicate that the genes involved in the modules may be drug targets for therapy.

Conclusions

We proposed a useful method to identify key modules in cancer prognosis and our prognostic genes may be good candidates for drug targets.

Electronic supplementary material

The online version of this article (10.1186/s12859-019-2674-z) contains supplementary material, which is available to authorized users.

Collapse

Borisov N, Shabalina I, Tkachev V, Sorokin M, Garazha A, Pulin A, Eremin II, Buzdin A. Shambhala: a platform-agnostic data harmonizer for gene expression data. BMC Bioinformatics 2019;20:66. [PMID: 30727942 PMCID: PMC6366102 DOI: 10.1186/s12859-019-2641-8] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 01/18/2019] [Indexed: 11/10/2022] Open

Darst BF, Malecki KC, Engelman CD. Using recursive feature elimination in random forest to account for correlated variables in high dimensional data. BMC Genet 2018;19:65. [PMID: 30255764 PMCID: PMC6157185 DOI: 10.1186/s12863-018-0633-8] [Citation(s) in RCA: 121] [Impact Index Per Article: 20.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Yan J, Kaur J. Feature Selection for Website Fingerprinting. PROCEEDINGS ON PRIVACY ENHANCING TECHNOLOGIES 2018. [DOI: 10.1515/popets-2018-0039] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]

Brieuc MSO, Waters CD, Drinan DP, Naish KA. A practical introduction to Random Forest for genetic association studies in ecology and evolution. Mol Ecol Resour 2018;18:755-766. [PMID: 29504715 DOI: 10.1111/1755-0998.12773] [Citation(s) in RCA: 59] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2017] [Revised: 02/08/2018] [Accepted: 02/17/2018] [Indexed: 12/25/2022]

Qiu X, Zhang L, Nagaratnam Suganthan P, Amaratunga GA. Oblique random forest ensemble via Least Square Estimation for time series forecasting. Inf Sci (N Y) 2017. [DOI: 10.1016/j.ins.2017.08.060] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]

Chen B, Gao S, Ji C, Song G. Integrated analysis reveals candidate genes and transcription factors in lung adenocarcinoma. Mol Med Rep 2017;16:8371-8379. [PMID: 28983631 DOI: 10.3892/mmr.2017.7656] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2016] [Accepted: 02/23/2017] [Indexed: 11/06/2022] Open

Abstract

Lung adenocarcinoma is the most common type of non‑small cell lung cancer in Asia. Therefore, it is important to improve understanding of the underlying transcriptional regulatory mechanisms involved. The present study aimed to identify potential candidate genes and transcription factors (TFs) associated with the disease. Four gene expression profiles were downloaded from the Gene Expression Omnibus database, which included 141 lung adenocarcinoma patients and 191 healthy controls. The differentially expressed genes (DEGs) were screened out and functional annotation was performed. In addition, TFs were identified and a global transcriptional regulatory network was constructed. Integrated analysis gave rise to a total of 1,238 DEGs in lung adenocarcinoma when compared with healthy tissues, including 970 upregulated and 268 downregulated DEGs. The six overexpressed outlier genes of ceruloplasmin, heparan sulfate 6‑O‑sulfotransferase 2, transmembrane protease serine 4, anillin actin binding protein, cellular retinoic acid binding protein 2 and cystatin SN may serve important roles in the development of lung adenocarcinoma. In addition, the downregulation of carbonic anhydrase 4 and S100 calcium binding protein A12 may render these effective diagnostic biomarkers. The results of the transcriptional regulatory network demonstrated that the hub nodes were sex determining region Y‑box 10, Spi‑B transcription factor and nuclear receptor subfamily 4 group A member 2. The four TFs, forkhead box D1, E74‑like ETS transcription factor 5, homeobox A5 and kruppel‑like factor 5, may warrant future investigations into their function in disease development. In conclusion, the present study provided for further studies a list of candidate genes and TFs for the detection and treatment of lung adenocarcinoma.

Collapse

Classification and Biomarker Genes Selection for Cancer Gene Expression Data Using Random Forest. IRANIAN JOURNAL OF PATHOLOGY 2017;12:339-347. [PMID: 29563929 PMCID: PMC5844678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/08/2016] [Accepted: 05/13/2017] [Indexed: 12/02/2022]

Zhao J, Bodner G, Rewald B. Phenotyping: Using Machine Learning for Improved Pairwise Genotype Classification Based on Root Traits. FRONTIERS IN PLANT SCIENCE 2016;7:1864. [PMID: 27999587 PMCID: PMC5138212 DOI: 10.3389/fpls.2016.01864] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/25/2016] [Accepted: 11/25/2016] [Indexed: 05/29/2023]

Dietrich S, Floegel A, Troll M, Kühn T, Rathmann W, Peters A, Sookthai D, von Bergen M, Kaaks R, Adamski J, Prehn C, Boeing H, Schulze MB, Illig T, Pischon T, Knüppel S, Wang-Sattler R, Drogan D. Random Survival Forest in practice: a method for modelling complex metabolomics data in time to event analysis. Int J Epidemiol 2016;45:1406-1420. [PMID: 27591264 DOI: 10.1093/ije/dyw145] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/20/2016] [Indexed: 11/14/2022] Open

Affiliation(s)

Stefan Dietrich Department of Epidemiology, German Institute of Human Nutrition, Nuthetal, Germany
Anna Floegel Department of Epidemiology, German Institute of Human Nutrition, Nuthetal, Germany
Martina Troll Research Unit of Molecular Epidemiology.,Institute of Epidemiology II, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
Tilman Kühn Division of Cancer Epidemiology, German Cancer Research Center (DKFZ), Heidelberg, Germany
Wolfgang Rathmann Institute for Biometrics and Epidemiology, Leibniz Center for Diabetes Research at Heinrich Heine University, Germany.,German Center for Diabetes Research (DZD), München-Neuherberg, Germany
Anette Peters Institute of Epidemiology II, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,German Center for Diabetes Research (DZD), München-Neuherberg, Germany.,Department of Environmental Health, Harvard School of Public Health, Boston, MA, USA and
Disorn Sookthai Division of Cancer Epidemiology, German Cancer Research Center (DKFZ), Heidelberg, Germany
Martin von Bergen Department of Molecular Systems Biology, Helmholtz Centre for Environmental Research (UFZ), Institute of Biochemistry, Faculty of Biosciences, Pharmacy and Psychology, University of Leipzig, Leipzig, Germany and Department of Chemistry and Bioscience, University of Aalborg, Aalborg East, Denmark
Rudolf Kaaks Division of Cancer Epidemiology, German Cancer Research Center (DKFZ), Heidelberg, Germany
Jerzy Adamski German Center for Diabetes Research (DZD), München-Neuherberg, Germany.,Institute of Experimental Genetics, Genome Analysis Center, Helmholtz Zentrum München, German Research Center for Environmental Health, München-Neuherberg, Germany.,Lehrstuhl für Experimentelle Genetik, Technische Universität München, Freising-Weihenstephan, Germany
Cornelia Prehn Institute of Experimental Genetics, Genome Analysis Center, Helmholtz Zentrum München, German Research Center for Environmental Health, München-Neuherberg, Germany
Heiner Boeing Department of Epidemiology, German Institute of Human Nutrition, Nuthetal, Germany
Matthias B Schulze German Center for Diabetes Research (DZD), München-Neuherberg, Germany.,Department of Molecular Epidemiology, German Institute of Human Nutrition, Nuthetal, Germany
Thomas Illig Research Unit of Molecular Epidemiology.,Hannover Unified Biobank, and Institute for Human Genetics, Hannover, Germany
Tobias Pischon Department of Epidemiology, German Institute of Human Nutrition, Nuthetal, Germany.,Molecular Epidemiology Group, Max Delbruck Center for Molecular Medicine (MDC) Berlin-Buch, Berlin, Germany
Sven Knüppel Department of Epidemiology, German Institute of Human Nutrition, Nuthetal, Germany
Rui Wang-Sattler Research Unit of Molecular Epidemiology.,Institute of Epidemiology II, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,German Center for Diabetes Research (DZD), München-Neuherberg, Germany
Dagmar Drogan Department of Epidemiology, German Institute of Human Nutrition, Nuthetal, Germany

Collapse

Ma C, Sastry KS, Flore M, Gehani S, Al-Bozom I, Feng Y, Serpedin E, Chouchane L, Chen Y, Huang Y. CrossLink: a novel method for cross-condition classification of cancer subtypes. BMC Genomics 2016;17 Suppl 7:549. [PMID: 27556419 PMCID: PMC5001207 DOI: 10.1186/s12864-016-2903-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Mapiye DS, Christoffels AG, Gamieldien J. Identification of phenotype-relevant differentially expressed genes in breast cancer demonstrates enhanced quantile discretization protocol's utility in multi-platform microarray data integration. J Bioinform Comput Biol 2016;14:1650022. [PMID: 27411306 DOI: 10.1142/s0219720016500220] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]