1
|
Guo R, Wei J, Sun L, Yu B, Chang G, Liu D, Zhang S, Yao Z, Xu M, Bu L. A survey on advancements in image-text multimodal models: From general techniques to biomedical implementations. Comput Biol Med 2024; 178:108709. [PMID: 38878398 DOI: 10.1016/j.compbiomed.2024.108709] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 06/01/2024] [Accepted: 06/03/2024] [Indexed: 07/24/2024]
Abstract
With the significant advancements of Large Language Models (LLMs) in the field of Natural Language Processing (NLP), the development of image-text multimodal models has garnered widespread attention. Current surveys on image-text multimodal models mainly focus on representative models or application domains, but lack a review on how general technical models influence the development of domain-specific models, which is crucial for domain researchers. Based on this, this paper first reviews the technological evolution of image-text multimodal models, from early explorations of feature space to visual language encoding structures, and then to the latest large model architectures. Next, from the perspective of technological evolution, we explain how the development of general image-text multimodal technologies promotes the progress of multimodal technologies in the biomedical field, as well as the importance and complexity of specific datasets in the biomedical domain. Then, centered on the tasks of image-text multimodal models, we analyze their common components and challenges. After that, we summarize the architecture, components, and data of general image-text multimodal models, and introduce the applications and improvements of image-text multimodal models in the biomedical field. Finally, we categorize the challenges faced in the development and application of general models into external factors and intrinsic factors, further refining them into 2 external factors and 5 intrinsic factors, and propose targeted solutions, providing guidance for future research directions. For more details and data, please visit our GitHub page: https://github.com/i2vec/A-survey-on-image-text-multimodal-models.
Collapse
Affiliation(s)
- Ruifeng Guo
- Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang, 110168, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Jingxuan Wei
- Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang, 110168, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Linzhuang Sun
- Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang, 110168, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Bihui Yu
- Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang, 110168, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Guiyong Chang
- Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang, 110168, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Dawei Liu
- Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang, 110168, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Sibo Zhang
- Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang, 110168, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Zhengbing Yao
- Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang, 110168, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Mingjun Xu
- Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang, 110168, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Liping Bu
- Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang, 110168, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
2
|
Yamada Y, Colan J, Davila A, Hasegawa Y. Multimodal semi-supervised learning for online recognition of multi-granularity surgical workflows. Int J Comput Assist Radiol Surg 2024; 19:1075-1083. [PMID: 38558289 PMCID: PMC11178653 DOI: 10.1007/s11548-024-03101-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 03/04/2024] [Indexed: 04/04/2024]
Abstract
Purpose Surgical workflow recognition is a challenging task that requires understanding multiple aspects of surgery, such as gestures, phases, and steps. However, most existing methods focus on single-task or single-modal models and rely on costly annotations for training. To address these limitations, we propose a novel semi-supervised learning approach that leverages multimodal data and self-supervision to create meaningful representations for various surgical tasks. Methods Our representation learning approach conducts two processes. In the first stage, time contrastive learning is used to learn spatiotemporal visual features from video data, without any labels. In the second stage, multimodal VAE fuses the visual features with kinematic data to obtain a shared representation, which is fed into recurrent neural networks for online recognition. Results Our method is evaluated on two datasets: JIGSAWS and MISAW. We confirmed that it achieved comparable or better performance in multi-granularity workflow recognition compared to fully supervised models specialized for each task. On the JIGSAWS Suturing dataset, we achieve a gesture recognition accuracy of 83.3%. In addition, our model is more efficient in annotation usage, as it can maintain high performance with only half of the labels. On the MISAW dataset, we achieve 84.0% AD-Accuracy in phase recognition and 56.8% AD-Accuracy in step recognition. Conclusion Our multimodal representation exhibits versatility across various surgical tasks and enhances annotation efficiency. This work has significant implications for real-time decision-making systems within the operating room.
Collapse
Affiliation(s)
- Yutaro Yamada
- Department of Micro-Nano Mechanical Science and Engineering, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Aichi, 464-8603, Japan.
| | - Jacinto Colan
- Department of Micro-Nano Mechanical Science and Engineering, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Aichi, 464-8603, Japan
| | - Ana Davila
- Institutes of Innovation for Future Society, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Aichi, 464-8601, Japan
| | - Yasuhisa Hasegawa
- Department of Micro-Nano Mechanical Science and Engineering, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Aichi, 464-8603, Japan
| |
Collapse
|
3
|
Ikegawa Y, Fukuma R, Sugano H, Oshino S, Tani N, Tamura K, Iimura Y, Suzuki H, Yamamoto S, Fujita Y, Nishimoto S, Kishima H, Yanagisawa T. Text and image generation from intracranial electroencephalography using an embedding space for text and images. J Neural Eng 2024; 21:036019. [PMID: 38648781 DOI: 10.1088/1741-2552/ad417a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 04/22/2024] [Indexed: 04/25/2024]
Abstract
Objective.Invasive brain-computer interfaces (BCIs) are promising communication devices for severely paralyzed patients. Recent advances in intracranial electroencephalography (iEEG) coupled with natural language processing have enhanced communication speed and accuracy. It should be noted that such a speech BCI uses signals from the motor cortex. However, BCIs based on motor cortical activities may experience signal deterioration in users with motor cortical degenerative diseases such as amyotrophic lateral sclerosis. An alternative approach to using iEEG of the motor cortex is necessary to support patients with such conditions.Approach. In this study, a multimodal embedding of text and images was used to decode visual semantic information from iEEG signals of the visual cortex to generate text and images. We used contrastive language-image pretraining (CLIP) embedding to represent images presented to 17 patients implanted with electrodes in the occipital and temporal cortices. A CLIP image vector was inferred from the high-γpower of the iEEG signals recorded while viewing the images.Main results.Text was generated by CLIPCAP from the inferred CLIP vector with better-than-chance accuracy. Then, an image was created from the generated text using StableDiffusion with significant accuracy.Significance.The text and images generated from iEEG through the CLIP embedding vector can be used for improved communication.
Collapse
Affiliation(s)
- Yuya Ikegawa
- Institute for Advanced Co-Creation Studies, Osaka University, Suita, Japan
| | - Ryohei Fukuma
- Institute for Advanced Co-Creation Studies, Osaka University, Suita, Japan
- Department of Neurosurgery, Graduate School of Medicine, Osaka University, Suita, Japan
| | - Hidenori Sugano
- Department of Neurosurgery, Juntendo University, Tokyo, Japan
| | - Satoru Oshino
- Department of Neurosurgery, Graduate School of Medicine, Osaka University, Suita, Japan
| | - Naoki Tani
- Department of Neurosurgery, Graduate School of Medicine, Osaka University, Suita, Japan
| | - Kentaro Tamura
- Department of Neurosurgery, Nara Medical University, Kashihara, Japan
| | - Yasushi Iimura
- Department of Neurosurgery, Juntendo University, Tokyo, Japan
| | - Hiroharu Suzuki
- Department of Neurosurgery, Juntendo University, Tokyo, Japan
| | - Shota Yamamoto
- Department of Neurosurgery, Graduate School of Medicine, Osaka University, Suita, Japan
| | - Yuya Fujita
- Department of Neurosurgery, Graduate School of Medicine, Osaka University, Suita, Japan
| | - Shinji Nishimoto
- National Institute of Information and Communications Technology (NICT), Center for Information and Neural Networks (CiNet), Suita, Japan
- Graduate School of Frontier Biosciences, Osaka University, Suita, Japan
| | - Haruhiko Kishima
- Department of Neurosurgery, Graduate School of Medicine, Osaka University, Suita, Japan
| | - Takufumi Yanagisawa
- Institute for Advanced Co-Creation Studies, Osaka University, Suita, Japan
- Department of Neurosurgery, Graduate School of Medicine, Osaka University, Suita, Japan
| |
Collapse
|
4
|
Fang Y, Zhang X, Xu W, Liu G, Zhao J. Bidirectional visual-tactile cross-modal generation using latent feature space flow model. Neural Netw 2024; 172:106088. [PMID: 38159510 DOI: 10.1016/j.neunet.2023.12.042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Revised: 12/09/2023] [Accepted: 12/22/2023] [Indexed: 01/03/2024]
Abstract
Inspired by visual-tactile cross-modal bidirectional mapping of the human brain, this paper introduces a novel approach to bidirectional mapping between visual and tactile data, an area not fully explored in the predominantly unidirectional existing studies. First, we adopt separate Variational AutoEncoder (VAE) models for visual and tactile data. Furthermore, we introduce a conditional flow model built on the VAE latent feature space, enabling cross-modal bidirectional mapping between visual and tactile data using one model. The experimental results show that our method achieves excellent performance in terms of the similarity between the generated data and the original data (Structural Similarity Index (SSIM) of visual data: 0.58, SSIM of tactile data: 0.80), the classification accuracy on generated data (visual data: 91.60%, tactile data: 88.05%), and the zero-shot classification accuracy between generated data and language (visual data: 44.49%, tactile data: 45.03%). To the best of our knowledge, the method proposed in this paper is the first one to utilize a single model to achieve bidirectional mapping between visual and tactile data. Our model and code will be made public after the acceptance of the paper.
Collapse
Affiliation(s)
- Yu Fang
- State Key Laboratory of Robotics and System, Harbin Institute of Technology, No. 2, Yikuang Street, Nangang District, Harbin, 150001, Heilongjiang, China
| | - Xuehe Zhang
- State Key Laboratory of Robotics and System, Harbin Institute of Technology, No. 2, Yikuang Street, Nangang District, Harbin, 150001, Heilongjiang, China.
| | - Wenqiang Xu
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, No. 800 Dongchuan Road, Minhang District, Shanghai, 200240, China
| | - Gangfeng Liu
- State Key Laboratory of Robotics and System, Harbin Institute of Technology, No. 2, Yikuang Street, Nangang District, Harbin, 150001, Heilongjiang, China
| | - Jie Zhao
- State Key Laboratory of Robotics and System, Harbin Institute of Technology, No. 2, Yikuang Street, Nangang District, Harbin, 150001, Heilongjiang, China
| |
Collapse
|
5
|
Sadok S, Leglaive S, Girin L, Alameda-Pineda X, Séguier R. A multimodal dynamical variational autoencoder for audiovisual speech representation learning. Neural Netw 2024; 172:106120. [PMID: 38266474 DOI: 10.1016/j.neunet.2024.106120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 10/25/2023] [Accepted: 01/09/2024] [Indexed: 01/26/2024]
Abstract
High-dimensional data such as natural images or speech signals exhibit some form of regularity, preventing their dimensions from varying independently. This suggests that there exists a lower dimensional latent representation from which the high-dimensional observed data were generated. Uncovering the hidden explanatory features of complex data is the goal of representation learning, and deep latent variable generative models have emerged as promising unsupervised approaches. In particular, the variational autoencoder (VAE) which is equipped with both a generative and an inference model allows for the analysis, transformation, and generation of various types of data. Over the past few years, the VAE has been extended to deal with data that are either multimodal or dynamical (i.e., sequential). In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audiovisual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.
Collapse
Affiliation(s)
| | | | - Laurent Girin
- Univ. Grenoble Alpes CNRS, Grenoble-INP, GIPSA-lab, France
| | | | | |
Collapse
|
6
|
Hoang NL, Taniguchi T, Hagiwara Y, Taniguchi A. Emergent communication of multimodal deep generative models based on Metropolis-Hastings naming game. Front Robot AI 2024; 10:1290604. [PMID: 38356917 PMCID: PMC10864618 DOI: 10.3389/frobt.2023.1290604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 12/18/2023] [Indexed: 02/16/2024] Open
Abstract
Deep generative models (DGM) are increasingly employed in emergent communication systems. However, their application in multimodal data contexts is limited. This study proposes a novel model that combines multimodal DGM with the Metropolis-Hastings (MH) naming game, enabling two agents to focus jointly on a shared subject and develop common vocabularies. The model proves that it can handle multimodal data, even in cases of missing modalities. Integrating the MH naming game with multimodal variational autoencoders (VAE) allows agents to form perceptual categories and exchange signs within multimodal contexts. Moreover, fine-tuning the weight ratio to favor a modality that the model could learn and categorize more readily improved communication. Our evaluation of three multimodal approaches - mixture-of-experts (MoE), product-of-experts (PoE), and mixture-of-product-of-experts (MoPoE)-suggests an impact on the creation of latent spaces, the internal representations of agents. Our results from experiments with the MNIST + SVHN and Multimodal165 datasets indicate that combining the Gaussian mixture model (GMM), PoE multimodal VAE, and MH naming game substantially improved information sharing, knowledge formation, and data reconstruction.
Collapse
Affiliation(s)
- Nguyen Le Hoang
- Graduate School of Information Science and Engineering, Ritsumeikan University, Kusatsu, Shiga, Japan
| | - Tadahiro Taniguchi
- College of Information Science and Engineering, Ritsumeikan University, Kusatsu, Shiga, Japan
| | - Yoshinobu Hagiwara
- Research Organization of Science and Technology, Ritsumeikan University, Kusatsu, Shiga, Japan
| | - Akira Taniguchi
- College of Information Science and Engineering, Ritsumeikan University, Kusatsu, Shiga, Japan
| |
Collapse
|
7
|
Noda K, Soda T, Yamashita Y. Emergence of number sense through the integration of multimodal information: developmental learning insights from neural network models. Front Neurosci 2024; 18:1330512. [PMID: 38298912 PMCID: PMC10828047 DOI: 10.3389/fnins.2024.1330512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Accepted: 01/02/2024] [Indexed: 02/02/2024] Open
Abstract
Introduction Associating multimodal information is essential for human cognitive abilities including mathematical skills. Multimodal learning has also attracted attention in the field of machine learning, and it has been suggested that the acquisition of better latent representation plays an important role in enhancing task performance. This study aimed to explore the impact of multimodal learning on representation, and to understand the relationship between multimodal representation and the development of mathematical skills. Methods We employed a multimodal deep neural network as the computational model for multimodal associations in the brain. We compared the representations of numerical information, that is, handwritten digits and images containing a variable number of geometric figures learned through single- and multimodal methods. Next, we evaluated whether these representations were beneficial for downstream arithmetic tasks. Results Multimodal training produced better latent representation in terms of clustering quality, which is consistent with previous findings on multimodal learning in deep neural networks. Moreover, the representations learned using multimodal information exhibited superior performance in arithmetic tasks. Discussion Our novel findings experimentally demonstrate that changes in acquired latent representations through multimodal association learning are directly related to cognitive functions, including mathematical skills. This supports the possibility that multimodal learning using deep neural network models may offer novel insights into higher cognitive functions.
Collapse
Affiliation(s)
| | | | - Yuichi Yamashita
- Department of Information Medicine, National Institute of Neuroscience, National Center of Neurology and Psychiatry, Kodaira, Japan
| |
Collapse
|
8
|
Marcinkevičs R, Reis Wolfertstetter P, Klimiene U, Chin-Cheong K, Paschke A, Zerres J, Denzinger M, Niederberger D, Wellmann S, Ozkan E, Knorr C, Vogt JE. Interpretable and intervenable ultrasonography-based machine learning models for pediatric appendicitis. Med Image Anal 2024; 91:103042. [PMID: 38000257 DOI: 10.1016/j.media.2023.103042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Revised: 11/10/2023] [Accepted: 11/20/2023] [Indexed: 11/26/2023]
Abstract
Appendicitis is among the most frequent reasons for pediatric abdominal surgeries. Previous decision support systems for appendicitis have focused on clinical, laboratory, scoring, and computed tomography data and have ignored abdominal ultrasound, despite its noninvasive nature and widespread availability. In this work, we present interpretable machine learning models for predicting the diagnosis, management and severity of suspected appendicitis using ultrasound images. Our approach utilizes concept bottleneck models (CBM) that facilitate interpretation and interaction with high-level concepts understandable to clinicians. Furthermore, we extend CBMs to prediction problems with multiple views and incomplete concept sets. Our models were trained on a dataset comprising 579 pediatric patients with 1709 ultrasound images accompanied by clinical and laboratory data. Results show that our proposed method enables clinicians to utilize a human-understandable and intervenable predictive model without compromising performance or requiring time-consuming image annotation when deployed. For predicting the diagnosis, the extended multiview CBM attained an AUROC of 0.80 and an AUPR of 0.92, performing comparably to similar black-box neural networks trained and tested on the same dataset.
Collapse
Affiliation(s)
- Ričards Marcinkevičs
- Department of Computer Science, ETH Zurich, Universitätstrasse 6, Zürich, 8092, Switzerland.
| | - Patricia Reis Wolfertstetter
- Department of Pediatric Surgery and Pediatric Orthopedics, Hospital St. Hedwig of the Order of St. John of God, University Children's Hospital Regensburg (KUNO), Steinmetzstrasse 1-3, Regensburg, 93049, Germany; Faculty of Medicine, University of Regensburg, Franz-Josef-Strauss-Allee 11, Regensburg, 93053, Germany.
| | - Ugne Klimiene
- Department of Computer Science, ETH Zurich, Universitätstrasse 6, Zürich, 8092, Switzerland
| | - Kieran Chin-Cheong
- Department of Computer Science, ETH Zurich, Universitätstrasse 6, Zürich, 8092, Switzerland
| | - Alyssia Paschke
- Faculty of Medicine, University of Regensburg, Franz-Josef-Strauss-Allee 11, Regensburg, 93053, Germany
| | - Julia Zerres
- Faculty of Medicine, University of Regensburg, Franz-Josef-Strauss-Allee 11, Regensburg, 93053, Germany
| | - Markus Denzinger
- Department of Pediatric Surgery and Pediatric Orthopedics, Hospital St. Hedwig of the Order of St. John of God, University Children's Hospital Regensburg (KUNO), Steinmetzstrasse 1-3, Regensburg, 93049, Germany; Faculty of Medicine, University of Regensburg, Franz-Josef-Strauss-Allee 11, Regensburg, 93053, Germany
| | - David Niederberger
- Department of Computer Science, ETH Zurich, Universitätstrasse 6, Zürich, 8092, Switzerland
| | - Sven Wellmann
- Faculty of Medicine, University of Regensburg, Franz-Josef-Strauss-Allee 11, Regensburg, 93053, Germany; Division of Neonatology, Hospital St. Hedwig of the Order of St. John of God, University Children's Hospital Regensburg (KUNO), Steinmetzstrasse 1-3, Regensburg, 93049, Germany
| | - Ece Ozkan
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, 43 Vassar Street, Cambridge, 02139, USA
| | - Christian Knorr
- Department of Pediatric Surgery and Pediatric Orthopedics, Hospital St. Hedwig of the Order of St. John of God, University Children's Hospital Regensburg (KUNO), Steinmetzstrasse 1-3, Regensburg, 93049, Germany
| | - Julia E Vogt
- Department of Computer Science, ETH Zurich, Universitätstrasse 6, Zürich, 8092, Switzerland.
| |
Collapse
|
9
|
Miyazawa K, Nagai T. Concept formation through multimodal integration using multimodal BERT and VQ-VAE. Adv Robot 2022. [DOI: 10.1080/01691864.2022.2141583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Kazuki Miyazawa
- Graduate School of Engineering Science, Osaka University, Osaka, Japan
| | - Takayuki Nagai
- Graduate School of Engineering Science, Osaka University, Osaka, Japan
- Artificial Intelligence Exploration Research Center, The University of Electro-Communications, Tokyo, Japan
| |
Collapse
|