Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For:	Suzuki M, Matsuo Y. A survey of multimodal deep generative models. Adv Robot 2022. [DOI: 10.1080/01691864.2022.2035253] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]

Number

Cited by Other Article(s)

Guo R, Wei J, Sun L, Yu B, Chang G, Liu D, Zhang S, Yao Z, Xu M, Bu L. A survey on advancements in image-text multimodal models: From general techniques to biomedical implementations. Comput Biol Med 2024;178:108709. [PMID: 38878398 DOI: 10.1016/j.compbiomed.2024.108709] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 06/01/2024] [Accepted: 06/03/2024] [Indexed: 07/24/2024]

Yamada Y, Colan J, Davila A, Hasegawa Y. Multimodal semi-supervised learning for online recognition of multi-granularity surgical workflows. Int J Comput Assist Radiol Surg 2024;19:1075-1083. [PMID: 38558289 PMCID: PMC11178653 DOI: 10.1007/s11548-024-03101-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 03/04/2024] [Indexed: 04/04/2024]

Ikegawa Y, Fukuma R, Sugano H, Oshino S, Tani N, Tamura K, Iimura Y, Suzuki H, Yamamoto S, Fujita Y, Nishimoto S, Kishima H, Yanagisawa T. Text and image generation from intracranial electroencephalography using an embedding space for text and images. J Neural Eng 2024;21:036019. [PMID: 38648781 DOI: 10.1088/1741-2552/ad417a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 04/22/2024] [Indexed: 04/25/2024]

Fang Y, Zhang X, Xu W, Liu G, Zhao J. Bidirectional visual-tactile cross-modal generation using latent feature space flow model. Neural Netw 2024;172:106088. [PMID: 38159510 DOI: 10.1016/j.neunet.2023.12.042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Revised: 12/09/2023] [Accepted: 12/22/2023] [Indexed: 01/03/2024]

Sadok S, Leglaive S, Girin L, Alameda-Pineda X, Séguier R. A multimodal dynamical variational autoencoder for audiovisual speech representation learning. Neural Netw 2024;172:106120. [PMID: 38266474 DOI: 10.1016/j.neunet.2024.106120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 10/25/2023] [Accepted: 01/09/2024] [Indexed: 01/26/2024]

Abstract

High-dimensional data such as natural images or speech signals exhibit some form of regularity, preventing their dimensions from varying independently. This suggests that there exists a lower dimensional latent representation from which the high-dimensional observed data were generated. Uncovering the hidden explanatory features of complex data is the goal of representation learning, and deep latent variable generative models have emerged as promising unsupervised approaches. In particular, the variational autoencoder (VAE) which is equipped with both a generative and an inference model allows for the analysis, transformation, and generation of various types of data. Over the past few years, the VAE has been extended to deal with data that are either multimodal or dynamical (i.e., sequential). In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audiovisual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.

Collapse

Hoang NL, Taniguchi T, Hagiwara Y, Taniguchi A. Emergent communication of multimodal deep generative models based on Metropolis-Hastings naming game. Front Robot AI 2024;10:1290604. [PMID: 38356917 PMCID: PMC10864618 DOI: 10.3389/frobt.2023.1290604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 12/18/2023] [Indexed: 02/16/2024] Open

Noda K, Soda T, Yamashita Y. Emergence of number sense through the integration of multimodal information: developmental learning insights from neural network models. Front Neurosci 2024;18:1330512. [PMID: 38298912 PMCID: PMC10828047 DOI: 10.3389/fnins.2024.1330512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Accepted: 01/02/2024] [Indexed: 02/02/2024] Open

Marcinkevičs R, Reis Wolfertstetter P, Klimiene U, Chin-Cheong K, Paschke A, Zerres J, Denzinger M, Niederberger D, Wellmann S, Ozkan E, Knorr C, Vogt JE. Interpretable and intervenable ultrasonography-based machine learning models for pediatric appendicitis. Med Image Anal 2024;91:103042. [PMID: 38000257 DOI: 10.1016/j.media.2023.103042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Revised: 11/10/2023] [Accepted: 11/20/2023] [Indexed: 11/26/2023]

Affiliation(s)

Ričards Marcinkevičs Department of Computer Science, ETH Zurich, Universitätstrasse 6, Zürich, 8092, Switzerland.
Patricia Reis Wolfertstetter Department of Pediatric Surgery and Pediatric Orthopedics, Hospital St. Hedwig of the Order of St. John of God, University Children's Hospital Regensburg (KUNO), Steinmetzstrasse 1-3, Regensburg, 93049, Germany; Faculty of Medicine, University of Regensburg, Franz-Josef-Strauss-Allee 11, Regensburg, 93053, Germany.
Ugne Klimiene Department of Computer Science, ETH Zurich, Universitätstrasse 6, Zürich, 8092, Switzerland
Kieran Chin-Cheong Department of Computer Science, ETH Zurich, Universitätstrasse 6, Zürich, 8092, Switzerland
Alyssia Paschke Faculty of Medicine, University of Regensburg, Franz-Josef-Strauss-Allee 11, Regensburg, 93053, Germany
Julia Zerres Faculty of Medicine, University of Regensburg, Franz-Josef-Strauss-Allee 11, Regensburg, 93053, Germany
Markus Denzinger Department of Pediatric Surgery and Pediatric Orthopedics, Hospital St. Hedwig of the Order of St. John of God, University Children's Hospital Regensburg (KUNO), Steinmetzstrasse 1-3, Regensburg, 93049, Germany; Faculty of Medicine, University of Regensburg, Franz-Josef-Strauss-Allee 11, Regensburg, 93053, Germany
David Niederberger Department of Computer Science, ETH Zurich, Universitätstrasse 6, Zürich, 8092, Switzerland
Sven Wellmann Faculty of Medicine, University of Regensburg, Franz-Josef-Strauss-Allee 11, Regensburg, 93053, Germany; Division of Neonatology, Hospital St. Hedwig of the Order of St. John of God, University Children's Hospital Regensburg (KUNO), Steinmetzstrasse 1-3, Regensburg, 93049, Germany
Ece Ozkan Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, 43 Vassar Street, Cambridge, 02139, USA
Christian Knorr Department of Pediatric Surgery and Pediatric Orthopedics, Hospital St. Hedwig of the Order of St. John of God, University Children's Hospital Regensburg (KUNO), Steinmetzstrasse 1-3, Regensburg, 93049, Germany
Julia E Vogt Department of Computer Science, ETH Zurich, Universitätstrasse 6, Zürich, 8092, Switzerland.

Collapse

Miyazawa K, Nagai T. Concept formation through multimodal integration using multimodal BERT and VQ-VAE. Adv Robot 2022. [DOI: 10.1080/01691864.2022.2141583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]