1
|
Borisov V, Leemann T, Sebler K, Haug J, Pawelczyk M, Kasneci G. Deep Neural Networks and Tabular Data: A Survey. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:7499-7519. [PMID: 37015381 DOI: 10.1109/tnnls.2022.3229161] [Citation(s) in RCA: 21] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Heterogeneous tabular data are the most commonly used form of data and are essential for numerous critical and computationally demanding applications. On homogeneous datasets, deep neural networks have repeatedly shown excellent performance and have therefore been widely adopted. However, their adaptation to tabular data for inference or data generation tasks remains highly challenging. To facilitate further progress in the field, this work provides an overview of state-of-the-art deep learning methods for tabular data. We categorize these methods into three groups: data transformations, specialized architectures, and regularization models. For each of these groups, our work offers a comprehensive overview of the main approaches. Moreover, we discuss deep learning approaches for generating tabular data and also provide an overview over strategies for explaining deep models on tabular data. Thus, our first contribution is to address the main research streams and existing methodologies in the mentioned areas while highlighting relevant challenges and open research questions. Our second contribution is to provide an empirical comparison of traditional machine learning methods with 11 deep learning approaches across five popular real-world tabular datasets of different sizes and with different learning objectives. Our results, which we have made publicly available as competitive benchmarks, indicate that algorithms based on gradient-boosted tree ensembles still mostly outperform deep learning models on supervised learning tasks, suggesting that the research progress on competitive deep learning models for tabular data is stagnating. To the best of our knowledge, this is the first in-depth overview of deep learning approaches for tabular data; as such, this work can serve as a valuable starting point to guide researchers and practitioners interested in deep learning with tabular data.
Collapse
|
2
|
Jia S, Lysenko A, Boroevich KA, Sharma A, Tsunoda T. scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning. Brief Bioinform 2023; 24:bbad266. [PMID: 37523217 PMCID: PMC10516353 DOI: 10.1093/bib/bbad266] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2023] [Revised: 06/12/2023] [Accepted: 07/04/2023] [Indexed: 08/01/2023] Open
Abstract
Annotation of cell-types is a critical step in the analysis of single-cell RNA sequencing (scRNA-seq) data that allows the study of heterogeneity across multiple cell populations. Currently, this is most commonly done using unsupervised clustering algorithms, which project single-cell expression data into a lower dimensional space and then cluster cells based on their distances from each other. However, as these methods do not use reference datasets, they can only achieve a rough classification of cell-types, and it is difficult to improve the recognition accuracy further. To effectively solve this issue, we propose a novel supervised annotation method, scDeepInsight. The scDeepInsight method is capable of performing manifold assignments. It is competent in executing data integration through batch normalization, performing supervised training on the reference dataset, doing outlier detection and annotating cell-types on query datasets. Moreover, it can help identify active genes or marker genes related to cell-types. The training of the scDeepInsight model is performed in a unique way. Tabular scRNA-seq data are first converted to corresponding images through the DeepInsight methodology. DeepInsight can create a trainable image transformer to convert non-image RNA data to images by comprehensively comparing interrelationships among multiple genes. Subsequently, the converted images are fed into convolutional neural networks such as EfficientNet-b3. This enables automatic feature extraction to identify the cell-types of scRNA-seq samples. We benchmarked scDeepInsight with six other mainstream cell annotation methods. The average accuracy rate of scDeepInsight reached 87.5%, which is more than 7% higher compared with the state-of-the-art methods.
Collapse
Affiliation(s)
- Shangru Jia
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Japan
| | - Artem Lysenko
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Japan
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Japan
| | - Keith A Boroevich
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Japan
| | - Alok Sharma
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Japan
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Japan
- Institute for Integrated and Intelligent Systems, Griffith University, Australia
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Japan
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Japan
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Japan
| |
Collapse
|
3
|
Dorn M, Grisci BI, Narloch PH, Feltes BC, Avila E, Kahmann A, Alho CS. Comparison of machine learning techniques to handle imbalanced COVID-19 CBC datasets. PeerJ Comput Sci 2021; 7:e670. [PMID: 34458574 PMCID: PMC8372002 DOI: 10.7717/peerj-cs.670] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Accepted: 07/20/2021] [Indexed: 06/13/2023]
Abstract
The Coronavirus pandemic caused by the novel SARS-CoV-2 has significantly impacted human health and the economy, especially in countries struggling with financial resources for medical testing and treatment, such as Brazil's case, the third most affected country by the pandemic. In this scenario, machine learning techniques have been heavily employed to analyze different types of medical data, and aid decision making, offering a low-cost alternative. Due to the urgency to fight the pandemic, a massive amount of works are applying machine learning approaches to clinical data, including complete blood count (CBC) tests, which are among the most widely available medical tests. In this work, we review the most employed machine learning classifiers for CBC data, together with popular sampling methods to deal with the class imbalance. Additionally, we describe and critically analyze three publicly available Brazilian COVID-19 CBC datasets and evaluate the performance of eight classifiers and five sampling techniques on the selected datasets. Our work provides a panorama of which classifier and sampling methods provide the best results for different relevant metrics and discuss their impact on future analyses. The metrics and algorithms are introduced in a way to aid newcomers to the field. Finally, the panorama discussed here can significantly benefit the comparison of the results of new ML algorithms.
Collapse
Affiliation(s)
- Marcio Dorn
- Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil
- Center of Biotechnology, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil
- Forensic Science, National Institute of Science and Technology, Porto Alegre, RS, Brazil
| | - Bruno Iochins Grisci
- Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil
| | - Pedro Henrique Narloch
- Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil
| | - Bruno César Feltes
- Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil
- Department of Genetics, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil
| | - Eduardo Avila
- Forensic Science, National Institute of Science and Technology, Porto Alegre, RS, Brazil
- School of Health and Life Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, RS, Brazil
| | - Alessandro Kahmann
- Institute of Mathematics, Statistics and Physics, Federal University of Rio Grande, Rio Grande, RS, Brazil
| | - Clarice Sampaio Alho
- Forensic Science, National Institute of Science and Technology, Porto Alegre, RS, Brazil
- School of Health and Life Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, RS, Brazil
| |
Collapse
|