1
|
Hariri M, Aydın A, Sıbıç O, Somuncu E, Yılmaz S, Sönmez S, Avşar E. LesionScanNet: dual-path convolutional neural network for acute appendicitis diagnosis. Health Inf Sci Syst 2025; 13:3. [PMID: 39654693 PMCID: PMC11625030 DOI: 10.1007/s13755-024-00321-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2024] [Accepted: 11/19/2024] [Indexed: 12/12/2024] Open
Abstract
Acute appendicitis is an abrupt inflammation of the appendix, which causes symptoms such as abdominal pain, vomiting, and fever. Computed tomography (CT) is a useful tool in accurate diagnosis of acute appendicitis; however, it causes challenges due to factors such as the anatomical structure of the colon and localization of the appendix in CT images. In this paper, a novel Convolutional Neural Network model, namely, LesionScanNet for the computer-aided detection of acute appendicitis has been proposed. For this purpose, a dataset of 2400 CT scan images was collected by the Department of General Surgery at Kanuni Sultan Süleyman Research and Training Hospital, Istanbul, Turkey. LesionScanNet is a lightweight model with 765 K parameters and includes multiple DualKernel blocks, where each block contains a convolution, expansion, separable convolution layers, and skip connections. The DualKernel blocks work with two paths of input image processing, one of which uses 3 × 3 filters, and the other path encompasses 1 × 1 filters. It has been demonstrated that the LesionScanNet model has an accuracy score of 99% on the test set, a value that is greater than the performance of the benchmark deep learning models. In addition, the generalization ability of the LesionScanNet model has been demonstrated on a chest X-ray image dataset for pneumonia and COVID-19 detection. In conclusion, LesionScanNet is a lightweight and robust network achieving superior performance with smaller number of parameters and its usage can be extended to other medical application domains.
Collapse
Affiliation(s)
- Muhab Hariri
- Electrical and Electronics Engineering Department, Çukurova University, 01330 Adana, Turkey
| | - Ahmet Aydın
- Biomedical Engineering Department, Çukurova University, 01330 Adana, Turkey
| | - Osman Sıbıç
- General Surgery Department, Derik State Hospital, 47800 Mardin, Turkey
| | - Erkan Somuncu
- General Surgery Department, Kanuni Sultan Suleyman Research and Training Hospital, 34303 Istanbul, Turkey
| | - Serhan Yılmaz
- General Surgery Department, Bilkent City Hospital, 06800 Ankara, Turkey
| | - Süleyman Sönmez
- Interventional Radiology Department, Kanuni Sultan Suleyman Research and Training Hospital, 34303 Istanbul, Turkey
| | - Ercan Avşar
- Section for Fisheries Technology, Institute of Aquatic Resources, DTU Aqua, Technical University of Denmark, 9850 Hirtshals, Denmark
| |
Collapse
|
2
|
Li Q. Visual image reconstructed without semantics from human brain activity using linear image decoders and nonlinear noise suppression. Cogn Neurodyn 2025; 19:20. [PMID: 39801914 PMCID: PMC11718044 DOI: 10.1007/s11571-024-10184-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Revised: 08/23/2024] [Accepted: 12/12/2024] [Indexed: 01/16/2025] Open
Abstract
In recent years, substantial strides have been made in the field of visual image reconstruction, particularly in its capacity to generate high-quality visual representations from human brain activity while considering semantic information. This advancement not only enables the recreation of visual content but also provides valuable insights into the intricate processes occurring within high-order functional brain regions, contributing to a deeper understanding of brain function. However, considering fusion semantics in reconstructing visual images from brain activity involves semantic-to-image guide reconstruction and may ignore underlying neural computational mechanisms, which does not represent true reconstruction from brain activity. In response to this limitation, our study introduces a novel approach that combines linear mapping with nonlinear noise suppression to reconstruct visual images perceived by subjects based on their brain activity patterns. The primary challenge associated with linear mapping lies in its susceptibility to noise interference. To address this issue, we leverage a flexible denoised deep convolutional neural network, which can suppress noise from linear mapping. Our investigation encompasses linear mapping as well as the training of shallow and deep autoencoder denoised neural networks, including a pre-trained, state-of-the-art denoised neural network. The outcome of our study reveals that combining linear image decoding with nonlinear noise reduction significantly enhances the quality of reconstructed images from human brain activity. This suggests that our methodology holds promise for decoding intricate perceptual experiences directly from brain activity patterns without semantic information. Moreover, the model has strong neural explanatory power because it shares structural and functional similarities with the visual brain.
Collapse
Affiliation(s)
- Qiang Li
- Image Processing Laboratory, University of Valencia, Valencia, Spain
- Tri-Institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, Emory University, Atlanta, GA USA
| |
Collapse
|
3
|
Tsang CC, Zhao C, Liu Y, Lin KPK, Tang JYM, Cheng KO, Chow FWN, Yao W, Chan KF, Poon SNL, Wong KYC, Zhou L, Mak OTN, Lee JCY, Zhao S, Ngan AHY, Wu AKL, Fung KSC, Que TL, Teng JLL, Schnieders D, Yiu SM, Lau SKP, Woo PCY. Automatic identification of clinically important Aspergillus species by artificial intelligence-based image recognition: proof-of-concept study. Emerg Microbes Infect 2025; 14:2434573. [PMID: 39585232 PMCID: PMC11632928 DOI: 10.1080/22221751.2024.2434573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2024] [Revised: 11/06/2024] [Accepted: 11/21/2024] [Indexed: 11/26/2024]
Abstract
While morphological examination is the most widely used for Aspergillus identification in clinical laboratories, PCR-sequencing and MALDI-TOF MS are emerging technologies in more financially-competent laboratories. However, mycological expertise, molecular biologists and/or expensive equipment are needed for these. Recently, artificial intelligence (AI), especially image recognition, is being increasingly employed in medicine for fast and automated disease diagnosis. We explored the potential utility of AI in identifying Aspergillus species. In this proof-of-concept study, using 2813, 2814 and 1240 images from four clinically important Aspergillus species for training, validation and testing, respectively; the performances and accuracies of automatic Aspergillus identification using colonial images by three different convolutional neural networks were evaluated. Results demonstrated that ResNet-18 outperformed Inception-v3 and DenseNet-121 and is the best algorithm of choice because it made the fewest misidentifications (n = 8) and possessed the highest testing accuracy (99.35%). Images showing more unique morphological features were more accurately identified. AI-based image recognition using colonial images is a promising technology for Aspergillus identification. Given its short turn-around-time, minimal demand of expertise, low reagent/equipment costs and user-friendliness, it has the potential to serve as a routine laboratory diagnostic tool after the database is further expanded.
Collapse
Affiliation(s)
- Chi-Ching Tsang
- School of Medical and Health Sciences, Tung Wah College, Homantin, Hong Kong
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
| | - Chenyang Zhao
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
| | - Yueh Liu
- Doctoral Program in Translational Medicine and Department of Life Sciences, National Chung Hsing University, Taichung, Taiwan
| | - Ken P. K. Lin
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
| | - James Y. M. Tang
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
| | - Kar-On Cheng
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
| | - Franklin W. N. Chow
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hunghom, Hong Kong
| | - Weiming Yao
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
| | - Ka-Fai Chan
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
| | - Sharon N. L. Poon
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
| | - Kelly Y. C. Wong
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
| | - Lianyi Zhou
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
| | - Oscar T. N. Mak
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
| | - Jeremy C. Y. Lee
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
| | - Suhui Zhao
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
| | - Antonio H. Y. Ngan
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
| | - Alan K. L. Wu
- Department of Clinical Pathology, Pamela Youde Nethersole Eastern Hospital, Chai Wan, Hong Kong
| | - Kitty S. C. Fung
- Department of Pathology, United Christian Hospital, Kwun Tong, Hong Kong
| | - Tak-Lun Que
- Department of Clinical Pathology, Tuen Mun Hospital, Tuen Mun, Hong Kong
| | - Jade L. L. Teng
- Faculty of Dentistry, The University of Hong Kong, Sai Ying Pun, Hong Kong
| | - Dirk Schnieders
- Department of Computer Science, Faculty of Engineering, The University of Hong Kong, Pokfulam, Hong Kong
| | - Siu-Ming Yiu
- Department of Computer Science, Faculty of Engineering, The University of Hong Kong, Pokfulam, Hong Kong
| | - Susanna K. P. Lau
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
| | - Patrick C. Y. Woo
- Department of Microbiology, School of Clinical Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong
- Doctoral Program in Translational Medicine and Department of Life Sciences, National Chung Hsing University, Taichung, Taiwan
- The iEGG and Animal Biotechnology Research Center, National Chung Hsing University, Taichung, Taiwan
| |
Collapse
|
4
|
Xu J, Cao R, Luo P, Mu D. Break Adhesion: Triple adaptive-parsing for weakly supervised instance segmentation. Neural Netw 2025; 186:107215. [PMID: 39951880 DOI: 10.1016/j.neunet.2025.107215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 01/22/2025] [Accepted: 01/23/2025] [Indexed: 02/17/2025]
Abstract
Weakly supervised instance segmentation (WSIS) aims to identify individual instances from weakly supervised semantic segmentation precisely. Existing WSIS techniques primarily employ a unified, fixed threshold to identify all peaks in semantic maps. It may lead to potential missed or false detections due to the same category but with diverse visual characteristics. Moreover, previous methods apply a fixed augmentation strategy to broadly propagate peak cues to contributing regions, resulting in instance adhesion. To eliminate these manually fixed parsing patterns, we propose a triple adaptive-parsing network. Specifically, an adaptive Peak Perception Module (PPM) employs the average degree of feature as a learning base to infer the optimal threshold. Simultaneously, we propose the Shrinkage Loss function (SL) to minimize outlier responses that deviate from the mean. Finally, by eliminating uncertain adhesion, our method effectively obtains Reliable Inter-instance Relationships (RIR), enhancing the representation of instances. Extensive experiments on the Pascal VOC and COCO datasets show that the proposed method improves the accuracy by 2.1% and 4.3%, achieving the latest performance standard and significantly optimizing the instance segmentation task. The code is available at https://github.com/Elaineok/TAP.
Collapse
Affiliation(s)
- Jingting Xu
- School of Automation, Northwestern Polytechnical University, Xi'an, 710129, China.
| | - Rui Cao
- School of Computer Science and Technology, Northwest University, Xi'an, 710127, China.
| | - Peng Luo
- School of Automation, Northwestern Polytechnical University, Xi'an, 710129, China.
| | - Dejun Mu
- School of Automation, Northwestern Polytechnical University, Xi'an, 710129, China; Research & Development Institute of Northwestern Polytechnical University, Shenzhen, 518057, China.
| |
Collapse
|
5
|
Karamimanesh M, Abiri E, Shahsavari M, Hassanli K, van Schaik A, Eshraghian J. Spiking neural networks on FPGA: A survey of methodologies and recent advancements. Neural Netw 2025; 186:107256. [PMID: 39965527 DOI: 10.1016/j.neunet.2025.107256] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Revised: 12/28/2024] [Accepted: 02/05/2025] [Indexed: 02/20/2025]
Abstract
The mimicry of the biological brain's structure in information processing enables spiking neural networks (SNNs) to exhibit significantly reduced power consumption compared to conventional systems. Consequently, these networks have garnered heightened attention and spurred extensive research endeavors in recent years, proposing various structures to achieve low power consumption, high speed, and improved recognition ability. However, researchers are still in the early stages of developing more efficient neural networks that more closely resemble the biological brain. This development and research require suitable hardware for execution with appropriate capabilities, and field-programmable gate array (FPGA) serves as a highly qualified candidate compared to existing hardware such as central processing unit (CPU) and graphics processing unit (GPU). FPGA, with parallel processing capabilities similar to the brain, lower latency and power consumption, and higher throughput, is highly eligible hardware for assisting in the development of spiking neural networks. In this review, an attempt has been made to facilitate researchers' path to further develop this field by collecting and examining recent works and the challenges that hinder the implementation of these networks on FPGA.
Collapse
Affiliation(s)
- Mehrzad Karamimanesh
- Department of Electrical Engineering, Shiraz University of Technology, Shiraz, Iran.
| | - Ebrahim Abiri
- Department of Electrical Engineering, Shiraz University of Technology, Shiraz, Iran.
| | - Mahyar Shahsavari
- AI Department, Donders Institute for Brain Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands.
| | - Kourosh Hassanli
- Department of Electrical Engineering, Shiraz University of Technology, Shiraz, Iran.
| | - André van Schaik
- The MARCS Institute, International Centre for Neuromorphic Systems, Western Sydney University, Australia.
| | - Jason Eshraghian
- Department of Electrical Engineering, University of California Santa Cruz, Santa Cruz, CA, USA.
| |
Collapse
|
6
|
Wang XZ, Yang DH, Yan ZP, You XD, Yin XY, Chen Y, Wang T, Wu HL, Yu RQ. Ultrafast on-site adulteration detection and quantification in Asian black truffle using smartphone-based computer vision. Talanta 2025; 288:127743. [PMID: 39965382 DOI: 10.1016/j.talanta.2025.127743] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2025] [Revised: 02/10/2025] [Accepted: 02/12/2025] [Indexed: 02/20/2025]
Abstract
Asian black truffle Tuber sinense (BT) is a premium edible fungus with medicinal value, but it is often prone to adulteration. This study aims to develop a fast, non-destructive, automatic, and intelligent method for identifying BT. A novel lightweight convolutional neural network model incorporates knowledge distillation (FastBTNet) to improve model efficiency on smartphones while maintaining higher performance. The well-trained model coupled with a fast object location technique was further employed for the absolute quantification of adulteration in BT. Results showed that FastBTNet achieved 99.0 % classification accuracy, 8.5 % root mean squared error in predicting adulteration levels, and 5.3 s for predicting 1024 samples. Additionally, Grad-CAM was used to investigate the models' recognition mechanism, and this strategy received a perfect score in the greenness assessment. These methods were deployed in a smartphone app, "Truffle Identifier," which enables ultrafast on-site identification of a batch of samples and assists in predicting adulteration levels.
Collapse
Affiliation(s)
- Xiao-Zhi Wang
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha, 410082, China
| | - De-Huan Yang
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha, 410082, China
| | - Zhan-Peng Yan
- College of Artificial Intelligence, Changsha NanFang Professional College, Changsha, 410208, China
| | - Xu-Dong You
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha, 410082, China
| | - Xiao-Yue Yin
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha, 410082, China
| | - Yao Chen
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha, 410082, China; Hunan Key Lab of Biomedical Materials and Devices, College of Life Sciences and Chemistry, Hunan University of Technology, Zhuzhou, 412007, China.
| | - Tong Wang
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha, 410082, China.
| | - Hai-Long Wu
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha, 410082, China
| | - Ru-Qin Yu
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha, 410082, China
| |
Collapse
|
7
|
Li X, Li L, Jiang Y, Wang H, Qiao X, Feng T, Luo H, Zhao Y. Vision-Language Models in medical image analysis: From simple fusion to general large models. INFORMATION FUSION 2025; 118:102995. [DOI: 10.1016/j.inffus.2025.102995] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/04/2025]
|
8
|
Genç İY, Gürfidan R, Yiğit T. Quality prediction of seabream Sparus aurata by deep learning algorithms and explainable artificial intelligence. Food Chem 2025; 474:143150. [PMID: 39923522 DOI: 10.1016/j.foodchem.2025.143150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Revised: 12/05/2024] [Accepted: 01/28/2025] [Indexed: 02/11/2025]
Abstract
In this study, Convolutional Neural Network (CNN), DenseNet121, Inception V3 and ResNet50 machine learning algorithms were used to determine the quality changes in sea bream stored in refrigerator conditions using eye and gill images. The sea bream were categorized into 3 different freshness categories as fresh, moderate and spoiled and analysed with machine learning algorithms. According to the Confusion matrix values, it was determined that the prediction performance of the model was 100 % and the lowest value was calculated to be 98.42 % in the spoiled class in the eye parameter. The values obtained from machine learning algorithms were analysed with Explainable Artificial Intelligence (XAI) algorithms (Grad-CAM and LIME). The study was concluded that the CNN and DenseNet 121 developed along with Grad-CAM and LIME is a non-destructive method that can be used in determining the freshness of sea bream under refrigerator conditions.
Collapse
Affiliation(s)
- İsmail Yüksel Genç
- Department of Fishing and Processing Technology, Eğirdir Faculty of Fisheries, Isparta University of Applied Sciences, Isparta, Turkey.
| | - Remzi Gürfidan
- Computer Programming, Yalvaç Vocational School of Technical Sciences, Isparta University of Applied Sciences, Isparta, Turkey.
| | - Tuncay Yiğit
- Department of Computer Engineering, Faculty of Engineering and Natural Sciences, University of Suleyman Demirel, Isparta, Turkey.
| |
Collapse
|
9
|
Brožová A, Šmídl V, Tichý O, Evangeliou N. Spatial-temporal source term estimation using deep neural network prior and its application to Chernobyl wildfires. JOURNAL OF HAZARDOUS MATERIALS 2025; 488:137510. [PMID: 39922073 DOI: 10.1016/j.jhazmat.2025.137510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/10/2024] [Revised: 02/03/2025] [Accepted: 02/03/2025] [Indexed: 02/10/2025]
Abstract
The source term of atmospheric emissions of hazardous materials is a crucial aspect of the analysis of unintended release. Motivated by wildfires of regions contaminated by radioactivity, the focus is placed on the case of airborne transmission of material from 5 dimensions: spatial location described by longitude and latitude in a given area with potentially many sources, time profiles, height above ground level, and the size of particles carrying the material. Since the atmospheric inverse problem is typically ill-posed and the number of measurements is usually too low to estimate the whole 5D tensor, some prior information is necessary. For the first time in this domain, a method based on deep image prior utilizing the structure of a deep neural network to regularize the inversion is proposed. The network is initialized randomly without the need to train it on any dataset first. In tandem with variational optimization, this approach not only introduces smoothness in the spatial estimate of the emissions but also reduces the number of unknowns by enforcing a prior covariance structure in the source term. The strengths of this method are demonstrated on the case of 137Cs emissions during the Chernobyl wildfires in 2020.
Collapse
Affiliation(s)
- Antonie Brožová
- Institute of Information Theory and Automation, Czech Academy of Sciences, Pod Vodárenskou věží 4, Prague 18200, Czech Republic; Department of Mathematics, Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University in Prague, Trojanova 13, Prague 11200, Czech Republic.
| | - Václav Šmídl
- Institute of Information Theory and Automation, Czech Academy of Sciences, Pod Vodárenskou věží 4, Prague 18200, Czech Republic
| | - Ondřej Tichý
- Institute of Information Theory and Automation, Czech Academy of Sciences, Pod Vodárenskou věží 4, Prague 18200, Czech Republic
| | - Nikolaos Evangeliou
- NILU, Department of Atmospheric & Climate Research (ATMOS), PO Box 100, Kjeller 2027, Norway
| |
Collapse
|
10
|
He A, Wu Y, Wang Z, Li T, Fu H. DVPT: Dynamic Visual Prompt Tuning of large pre-trained models for medical image analysis. Neural Netw 2025; 185:107168. [PMID: 39827840 DOI: 10.1016/j.neunet.2025.107168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Revised: 11/07/2024] [Accepted: 01/12/2025] [Indexed: 01/22/2025]
Abstract
Pre-training and fine-tuning have become popular due to the rich representations embedded in large pre-trained models, which can be leveraged for downstream medical tasks. However, existing methods typically either fine-tune all parameters or only task-specific layers of pre-trained models, overlooking the variability in input medical images. As a result, these approaches may lack efficiency or effectiveness. In this study, our goal is to explore parameter-efficient fine-tuning (PEFT) for medical image analysis. To address this challenge, we introduce a novel method called Dynamic Visual Prompt Tuning (DVPT). It can extract knowledge beneficial to downstream tasks from large models with only a few trainable parameters. First, the frozen features are transformed by a lightweight bottleneck layer to learn the domain-specific distribution of downstream medical tasks. Then, a few learnable visual prompts are employed as dynamic queries to conduct cross-attention with the transformed features, aiming to acquire sample-specific features. This DVPT module can be shared across different Transformer layers, further reducing the number of trainable parameters. We conduct extensive experiments with various pre-trained models on medical classification and segmentation tasks. We find that this PEFT method not only efficiently adapts pre-trained models to the medical domain but also enhances data efficiency with limited labeled data. For example, with only 0.5% additional trainable parameters, our method not only outperforms state-of-the-art PEFT methods but also surpasses full fine-tuning by more than 2.20% in Kappa score on the medical classification task. It can save up to 60% of labeled data and 99% of storage cost of ViT-B/16.
Collapse
Affiliation(s)
- Along He
- College of Computer Science, Tianjin Key Laboratory of Network and Data Security Technology, Nankai University, Tianjin, 300350, China
| | - Yanlin Wu
- College of Computer Science, Tianjin Key Laboratory of Network and Data Security Technology, Nankai University, Tianjin, 300350, China
| | - Zhihong Wang
- College of Computer Science, Tianjin Key Laboratory of Network and Data Security Technology, Nankai University, Tianjin, 300350, China
| | - Tao Li
- College of Computer Science, Tianjin Key Laboratory of Network and Data Security Technology, Nankai University, Tianjin, 300350, China.
| | - Huazhu Fu
- Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), 138632, Singapore
| |
Collapse
|
11
|
Liu M, Tang J, Chen Y, Li H, Qi J, Li S, Wang K, Gan J, Wang Y, Chen H. Spiking-PhysFormer: Camera-based remote photoplethysmography with parallel spike-driven transformer. Neural Netw 2025; 185:107128. [PMID: 39817982 DOI: 10.1016/j.neunet.2025.107128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 11/12/2024] [Accepted: 01/03/2025] [Indexed: 01/18/2025]
Abstract
Artificial neural networks (ANNs) can help camera-based remote photoplethysmography (rPPG) in measuring cardiac activity and physiological signals from facial videos, such as pulse wave, heart rate and respiration rate with better accuracy. However, most existing ANN-based methods require substantial computing resources, which poses challenges for effective deployment on mobile devices. Spiking neural networks (SNNs), on the other hand, hold immense potential for energy-efficient deep learning owing to their binary and event-driven architecture. To the best of our knowledge, we are the first to introduce SNNs into the realm of rPPG, proposing a hybrid neural network (HNN) model, the Spiking-PhysFormer, aimed at reducing power consumption. Specifically, the proposed Spiking-PhyFormer consists of an ANN-based patch embedding block, SNN-based transformer blocks, and an ANN-based predictor head. First, to simplify the transformer block while preserving its capacity to aggregate local and global spatio-temporal features, we design a parallel spike transformer block to replace sequential sub-blocks. Additionally, we propose a simplified spiking self-attention mechanism that omits the value parameter without compromising the model's performance. Experiments conducted on four datasets-PURE, UBFC-rPPG, UBFC-Phys, and MMPD demonstrate that the proposed model achieves a 10.1% reduction in power consumption compared to PhysFormer. Additionally, the power consumption of the transformer block is reduced by a factor of 12.2, while maintaining decent performance as PhysFormer and other ANN-based models.
Collapse
Affiliation(s)
| | | | - Yongli Chen
- Beijing Smartchip Microelectronics Technology Co., Ltd, Beijing, China
| | | | | | - Siwei Li
- Tsinghua University, Beijing, China
| | | | - Jie Gan
- Beijing Smartchip Microelectronics Technology Co., Ltd, Beijing, China
| | - Yuntao Wang
- Tsinghua University, Beijing, China; National Key Laboratory of Human Factors Engineering, Beijing, China.
| | - Hong Chen
- Tsinghua University, Beijing, China.
| |
Collapse
|
12
|
Jiang Z, Tang N, Sun J, Zhan Y. Combining various training and adaptation algorithms for ensemble few-shot classification. Neural Netw 2025; 185:107211. [PMID: 39889377 DOI: 10.1016/j.neunet.2025.107211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2024] [Revised: 12/30/2024] [Accepted: 01/23/2025] [Indexed: 02/03/2025]
Abstract
To mitigate the shortage of labeled data, Few-Shot Classification (FSC) methods train deep neural networks (DNNs) on a base dataset with sufficient labeled data, and then adapt them to target tasks using a few labeled data. Despite notable progress, a single FSC model remains prone to high variance and low confidence. As a result, ensemble FSC has garnered increasing attention. However, the limited labeled data and the high computational cost associated with DNNs present significant challenges for ensemble FSC methods. This paper presents a novel ensemble method that generates multiple FSC models via combining various training and adaptation algorithms. Due to the reuse of training phases, the proposed method significantly reduces the learning cost while generating base models with greater diversity. To further minimize reliance on labeled data, we provide each model with pseudo-labeled data selected by the majority vote of other models. Compared with self-training style methods, this "one-vs-others" learning strategy effectively reduces pseudo-label noise and confirmation bias. Finally, we conduct extensive experiments on miniImageNet, tieredImageNet and CUB datasets. The experimental results demonstrate that our method outperforms other state-of-the-art FSC methods. Especially, our method achieves the greatest improvement in the performance of base models. The source code and related models are available at https://github.com/tn1999tn/Ensemble-FSC/tree/master.
Collapse
Affiliation(s)
- Zhen Jiang
- School of Computer Science and Communication Engineering, JiangSu University, ZhenJiang, China.
| | - Na Tang
- School of Computer Science and Communication Engineering, JiangSu University, ZhenJiang, China
| | - Jianlong Sun
- School of Computer Science and Communication Engineering, JiangSu University, ZhenJiang, China
| | - Yongzhao Zhan
- School of Computer Science and Communication Engineering, JiangSu University, ZhenJiang, China
| |
Collapse
|
13
|
Hernández-Cámara P, Vila-Tomás J, Laparra V, Malo J. Dissecting the effectiveness of deep features as metric of perceptual image quality. Neural Netw 2025; 185:107189. [PMID: 39874824 DOI: 10.1016/j.neunet.2025.107189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 01/07/2025] [Accepted: 01/15/2025] [Indexed: 01/30/2025]
Abstract
There is an open debate on the role of artificial networks to understand the visual brain. Internal representations of images in artificial networks develop human-like properties. In particular, evaluating distortions using differences between internal features is correlated to human perception of distortion. However, the origins of this correlation are not well understood. Here, we dissect the different factors involved in the emergence of human-like behavior: function, architecture, and environment. To do so, we evaluate the aforementioned human-network correlation at different depths of 46 pre-trained model configurations that include no psycho-visual information. The results show that most of the models correlate better with human opinion than SSIM (a de-facto standard in subjective image quality). Moreover, some models are better than state-of-the-art networks specifically tuned for the application (LPIPS, DISTS). Regarding the function, supervised classification leads to nets that correlate better with humans than the explored models for self- and non-supervised tasks. However, we found that better performance in the task does not imply more human behavior. Regarding the architecture, simpler models correlate better with humans than very deep nets and generally, the highest correlation is not achieved in the last layer. Finally, regarding the environment, training with large natural datasets leads to bigger correlations than training in smaller databases with restricted content, as expected. We also found that the best classification models are not the best for predicting human distances. In the general debate about understanding human vision, our empirical findings imply that explanations have not to be focused on a single abstraction level, but all function, architecture, and environment are relevant.
Collapse
Affiliation(s)
| | - Jorge Vila-Tomás
- Image Processing Lab., Universitat de València, 46980 Paterna, Spain.
| | - Valero Laparra
- Image Processing Lab., Universitat de València, 46980 Paterna, Spain.
| | - Jesús Malo
- Image Processing Lab., Universitat de València, 46980 Paterna, Spain.
| |
Collapse
|
14
|
Bao J, Zhang J, Zhang C, Bao L. DCTCNet: Sequency discrete cosine transform convolution network for visual recognition. Neural Netw 2025; 185:107143. [PMID: 39847941 DOI: 10.1016/j.neunet.2025.107143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Revised: 01/01/2025] [Accepted: 01/09/2025] [Indexed: 01/25/2025]
Abstract
The discrete cosine transform (DCT) has been widely used in computer vision tasks due to its ability of high compression ratio and high-quality visual presentation. However, conventional DCT is usually affected by the size of transform region and results in blocking effect. Therefore, eliminating the blocking effects to efficiently serve for vision tasks is significant and challenging. In this paper, we introduce All Phase Sequency DCT (APSeDCT) into convolutional networks to extract multi-frequency information of deep features. Due to the fact that APSeDCT can be equivalent to convolutional operation, we construct corresponding convolution module called APSeDCT Convolution (APSeDCTConv) that has great transferability similar to vanilla convolution. Then we propose an augmented convolutional operator called MultiConv with APSeDCTConv. By replacing the last three bottleneck blocks of ResNet with MultiConv, our approach not only reduces the computational costs and the number of parameters, but also exhibits great performance in classification, object detection and instance segmentation tasks. Extensive experiments show that APSeDCTConv augmentation leads to consistent performance improvements in image classification on ImageNet across various different models and scales, including ResNet, Res2Net and ResNext, and achieving 0.5%-1.1% and 0.4%-0.7% AP performance improvements for object detection and instance segmentation, respectively, on the COCO benchmark compared to the baseline.
Collapse
Affiliation(s)
- Jiayong Bao
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Jiangshe Zhang
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China.
| | - Chunxia Zhang
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Lili Bao
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China
| |
Collapse
|
15
|
Fan J, Chen M, Gu Z, Yang J, Wu H, Wu J. SSIM over MSE: A new perspective for video anomaly detection. Neural Netw 2025; 185:107115. [PMID: 39855001 DOI: 10.1016/j.neunet.2024.107115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Revised: 12/26/2024] [Accepted: 12/28/2024] [Indexed: 01/27/2025]
Abstract
Video anomaly detection plays a crucial role in ensuring public safety. Its goal is to detect abnormal patterns contained in video frames. Most existing models distinguish the anomalies based on the Mean Squared Error (MSE), which is hard to align with human perception, resulting in discrepancies between model-detected anomalies and those recognized by humans. Unlike the Human Visual System (HVS), those models are trained to prioritize texture over shape, which leads to poor model interpretability and limited performance. To address these limitations, we propose to optimize the video anomaly detection models from the perspective of human visual relevance. The optimization infrastructure includes a novel Structural Similarity Index (SSIM) based loss, a novel anomaly score calculation method based on SSIM, and a spatial-temporal enhancement block in 3D convolution (STE-3D). SSIM loss helps the model emphasize shape information in videos rather than texture. An anomaly score method based on SSIM evaluates video frames to align more closely with human visual perception. STE-3D improves the model's capacity to capture spatial-temporal features and compensates for the deficiency of the SSIM loss in capturing temporal features. STE-3D is lightweight in design and seamlessly integrated into existing video anomaly detection models based on 3D convolution. Extensive experiments and ablation studies were conducted in four challenging video anomaly detection benchmarks,i.e., UCSD Ped1, UCSD Ped2, CUHK Avenue, and ShanghaiTech. The experimental results validate the efficacy of the proposed approaches in improving video anomaly detection performance.
Collapse
Affiliation(s)
- Jin Fan
- Department of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, Zhejiang, China; Zhejiang Provincial Key Laboratory of Internet in Discrete Industries, Hangzhou Dianzi University, Hangzhou, 310018, Zhejiang, China; Research and Development Center of Transport Industry of New Generation of Artificial Intelligence Technology, Hangzhou, 310018, Zhejiang, China
| | - Miao Chen
- Department of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, Zhejiang, China
| | - Zhangyu Gu
- Department of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, Zhejiang, China
| | - Jiajun Yang
- Department of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, Zhejiang, China
| | - Huifeng Wu
- Department of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, Zhejiang, China; Zhejiang Provincial Key Laboratory of Internet in Discrete Industries, Hangzhou Dianzi University, Hangzhou, 310018, Zhejiang, China.
| | - Jia Wu
- Department of Computing, Macquarie University, Sydney, 4627345, New South Wales, Australia
| |
Collapse
|
16
|
Li J, Mo W, Song F, Sun C, Qiang W, Su B, Zheng C. Supporting vision-language model few-shot inference with confounder-pruned knowledge prompt. Neural Netw 2025; 185:107173. [PMID: 39855003 DOI: 10.1016/j.neunet.2025.107173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Revised: 12/07/2024] [Accepted: 01/13/2025] [Indexed: 01/27/2025]
Abstract
Vision-language models are pre-trained by aligning image-text pairs in a common space to deal with open-set visual concepts. Recent works adopt fixed or learnable prompts, i.e., classification weights are synthesized from natural language descriptions of task-relevant categories, to reduce the gap between tasks during the pre-training and inference phases. However, how and what prompts can improve inference performance remains unclear. In this paper, we explicitly clarify the importance of incorporating semantic information into prompts, while existing prompting methods generate prompts without sufficiently exploring the semantic information of textual labels. Manually constructing prompts with rich semantics requires domain expertise and is extremely time-consuming. To cope with this issue, we propose a knowledge-aware prompt learning method, namely Confounder-pruned Knowledge Prompt (CPKP), which retrieves an ontology knowledge graph by treating the textual label as a query to extract task-relevant semantic information. CPKP further introduces a double-tier confounder-pruning procedure to refine the derived semantic information. Adhering to the individual causal effect principle, the graph-tier confounders are gradually identified and phased out. The feature-tier confounders are eliminated by following the maximum entropy principle in information theory. Empirically, the evaluations demonstrate the effectiveness of CPKP in few-shot inference, e.g., with only two shots, CPKP outperforms the manual-prompt method by 4.64% and the learnable-prompt method by 1.09% on average.
Collapse
Affiliation(s)
- Jiangmeng Li
- National Key Laboratory of Space Integrated Information System, Institute of Software Chinese Academy of Sciences, Beijing, China
| | - Wenyi Mo
- Renmin University of China, Beijing, China
| | - Fei Song
- National Key Laboratory of Space Integrated Information System, Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
| | - Chuxiong Sun
- National Key Laboratory of Space Integrated Information System, Institute of Software Chinese Academy of Sciences, Beijing, China.
| | - Wenwen Qiang
- National Key Laboratory of Space Integrated Information System, Institute of Software Chinese Academy of Sciences, Beijing, China
| | - Bing Su
- Renmin University of China, Beijing, China
| | - Changwen Zheng
- National Key Laboratory of Space Integrated Information System, Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
17
|
Babé A, Cuingnet R, Scuturici M, Miguet S. Generalization abilities of foundation models in waste classification. WASTE MANAGEMENT (NEW YORK, N.Y.) 2025; 198:187-197. [PMID: 40054101 DOI: 10.1016/j.wasman.2025.02.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2024] [Revised: 02/13/2025] [Accepted: 02/14/2025] [Indexed: 03/14/2025]
Abstract
Industrial waste classification systems based on computer vision require strong generalization abilities across location and time period in order to be deployed. This study investigates the potential of foundation models, known for their adaptability to a wide range of tasks and promising generalization capabilities, to serve as the basis for such systems. To evaluate the generalization performance of foundation models we use five waste classification datasets spanning various domains, train the models on one dataset and test them on all others. Additionally, we explore various training procedures to optimize foundation model adaptation for this specific domain. Our findings reveal that foundation models exhibit superior generalization abilities compared to standard models and that good generalization performance is correlated with the model size and the size of the model pretraining dataset. Furthermore, we demonstrate that elaborate classifier heads are not necessary for extracting discriminative features from foundation models. Both standard fine-tuning and Parameter-Efficient Fine-tuning (PEFT) improve generalization performance, with PEFT being particularly effective for larger models. Simple data augmentation techniques were found to be ineffective. Overall, application of foundation models to industrial waste classification holds very promising results.
Collapse
Affiliation(s)
- Aloïs Babé
- Université Lumière Lyon 2, CNRS, Ecole Centrale de Lyon, INSA Lyon, Université Claude Bernard, Lyon 1, LIRIS, UMR 5205, Bron 69676, France; Veolia Scientific & Technical Expertise Department, Maisons-Laffitte 78600, France.
| | - Rémi Cuingnet
- Veolia Scientific & Technical Expertise Department, Maisons-Laffitte 78600, France
| | - Mihaela Scuturici
- Université Lumière Lyon 2, CNRS, Ecole Centrale de Lyon, INSA Lyon, Université Claude Bernard, Lyon 1, LIRIS, UMR 5205, Bron 69676, France
| | - Serge Miguet
- Université Lumière Lyon 2, CNRS, Ecole Centrale de Lyon, INSA Lyon, Université Claude Bernard, Lyon 1, LIRIS, UMR 5205, Bron 69676, France
| |
Collapse
|
18
|
Lin L, Chen Y, Ma Z, Lei M. Quantitative ultrasonic characterization of fractal-based pore distribution homogeneity with variable observation scales in heterogeneous medium. ULTRASONICS 2025; 149:107596. [PMID: 39929091 DOI: 10.1016/j.ultras.2025.107596] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Revised: 02/05/2025] [Accepted: 02/06/2025] [Indexed: 03/18/2025]
Abstract
The characterization of pore distribution homogeneity in heterogeneous medium is difficult due to the lack of quantitative description of homogeneity, and the degree of homogeneity is closely related to measurement method and observation scale. In this paper, a kind of quantitative ultrasonic characterization strategy based on fractal theory, which takes into account the principle of matching observation scale with acoustic beam size, is proposed. The ultrasonic signals containing information about heterogeneous seal coating are extracted through water-immersed ultrasonic pulse-echo reflection method to characterize pore distribution homogeneity. The fractal dimension D and multifractal spectral symmetry B are specifically used to parameterize pore distribution homogeneity of microscopic images within acoustic beam size. By establishing simulation models combined with experimental microscopic images, the effects of pore number and size distribution on ultrasonic attenuation coefficient α are analyzed. Furthermore, the relationships between attenuation coefficient and the above two fractal parameters are established to quantitatively characterize pore distribution homogeneity with porosity of 1 %∼6 % and scales ranging from several to tens of microns. Finally, correlation coefficient R and root mean square error RMSE of the attenuation coefficient varying with two fractal parameters at variable observation scales of 3 mm, 2 mm, 1 mm, and 0.5 mm are compared. It should be noticed that considering the principle of matching observation scale with the acoustic beam size is crucial for quantitative ultrasonic characterization of fractal-based pore distribution homogeneity in heterogeneous medium. And the observation scale should be equal to or larger than acoustic beam size, which is ≥ 2 mm, under the testing conditions in this research.
Collapse
Affiliation(s)
- Li Lin
- NDT & E Laboratory, Dalian University of Technology, Dalian 116024 China
| | - Yijia Chen
- NDT & E Laboratory, Dalian University of Technology, Dalian 116024 China
| | - Zhiyuan Ma
- NDT & E Laboratory, Dalian University of Technology, Dalian 116024 China.
| | - Mingkai Lei
- School of Materials Science and Engineering, Dalian University of Technology, Dalian 116024 China
| |
Collapse
|
19
|
Lan Z, Li Z, Yan C, Xiang X, Tang D, Wu M, Chen Z. RMKD: Relaxed matching knowledge distillation for short-length SSVEP-based brain-computer interfaces. Neural Netw 2025; 185:107133. [PMID: 39862529 DOI: 10.1016/j.neunet.2025.107133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Revised: 12/09/2024] [Accepted: 01/05/2025] [Indexed: 01/27/2025]
Abstract
Accurate decoding of electroencephalogram (EEG) signals in the shortest possible time is essential for the realization of a high-performance brain-computer interface (BCI) system based on the steady-state visual evoked potential (SSVEP). However, the degradation of decoding performance of short-length EEG signals is often unavoidable due to the reduced information, which hinders the development of BCI systems in real-world applications. In this paper, we propose a relaxed matching knowledge distillation (RMKD) method to transfer both feature-level and logit-level knowledge in a relaxed manner to improve the decoding performance of short-length EEG signals. Specifically, the long-length EEG signals and short-length EEG signals are decoded into the frequency representation by the teacher and student models, respectively. At the feature-level, the frequency-masked generation distillation is designed to improve the representation ability of student features by forcing the randomly masked student features to generate full teacher features. At the logit-level, the non-target class knowledge distillation and the inter-class relation distillation are combined to mitigate loss conflicts by imitating the distribution of non-target classes and preserve the inter-class relation in the prediction vectors of the teacher and student models. We conduct comprehensive experiments on two public SSVEP datasets in the subject-independent scenario with six different signal lengths. The extensive experimental results demonstrate that the proposed RMKD method has significantly improved the decoding performance of short-length EEG signals in SSVEP-based BCI systems.
Collapse
Affiliation(s)
- Zhen Lan
- College of Intelligence Science and Technology, National University of Defense Technology, Changsha, 410073, China; Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), 138632, Singapore.
| | - Zixing Li
- College of Intelligence Science and Technology, National University of Defense Technology, Changsha, 410073, China.
| | - Chao Yan
- College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China.
| | - Xiaojia Xiang
- College of Intelligence Science and Technology, National University of Defense Technology, Changsha, 410073, China.
| | - Dengqing Tang
- College of Intelligence Science and Technology, National University of Defense Technology, Changsha, 410073, China.
| | - Min Wu
- Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), 138632, Singapore.
| | - Zhenghua Chen
- Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), 138632, Singapore.
| |
Collapse
|
20
|
Lou M, Ying H, Liu X, Zhou HY, Zhang Y, Yu Y. SDR-Former: A Siamese Dual-Resolution Transformer for liver lesion classification using 3D multi-phase imaging. Neural Netw 2025; 185:107228. [PMID: 39908910 DOI: 10.1016/j.neunet.2025.107228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2024] [Revised: 12/28/2024] [Accepted: 01/27/2025] [Indexed: 02/07/2025]
Abstract
Automated classification of liver lesions in multi-phase CT and MR scans is of clinical significance but challenging. This study proposes a novel Siamese Dual-Resolution Transformer (SDR-Former) framework, specifically designed for liver lesion classification in 3D multi-phase CT and MR imaging with varying phase counts. The proposed SDR-Former utilizes a streamlined Siamese Neural Network (SNN) to process multi-phase imaging inputs, possessing robust feature representations while maintaining computational efficiency. The weight-sharing feature of the SNN is further enriched by a hybrid Dual-Resolution Transformer (DR-Former), comprising a 3D Convolutional Neural Network (CNN) and a tailored 3D Transformer for processing high- and low-resolution images, respectively. This hybrid sub-architecture excels in capturing detailed local features and understanding global contextual information, thereby, boosting the SNN's feature extraction capabilities. Additionally, a novel Adaptive Phase Selection Module (APSM) is introduced, promoting phase-specific intercommunication and dynamically adjusting each phase's influence on the diagnostic outcome. The proposed SDR-Former framework has been validated through comprehensive experiments on two clinically collected datasets: a 3-phase CT dataset and an 8-phase MR dataset. The experimental results affirm the efficacy of the proposed framework. To support the scientific community, we are releasing our extensive multi-phase MR dataset for liver lesion analysis to the public. This pioneering dataset, being the first publicly available multi-phase MR dataset in this field, also underpins the MICCAI LLD-MMRI Challenge. The dataset is publicly available at: https://github.com/LMMMEng/LLD-MMRI-Dataset.
Collapse
Affiliation(s)
- Meng Lou
- School of Computing and Data Science, The University of Hong Kong, Hong Kong SAR, China; AI Lab, Deepwise Healthcare, Beijing, China.
| | - Hanning Ying
- Department of General Surgery, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China.
| | | | - Hong-Yu Zhou
- School of Computing and Data Science, The University of Hong Kong, Hong Kong SAR, China; Department of Biomedical Informatics, Harvard Medical School, Boston, USA.
| | - Yuqin Zhang
- Department of Radiology, The Affiliated LiHuiLi Hospital of Ningbo University, Ningbo, Zhejiang, China.
| | - Yizhou Yu
- School of Computing and Data Science, The University of Hong Kong, Hong Kong SAR, China.
| |
Collapse
|
21
|
Zhao W, Li W, Tian Y, Hu E, Liu W, Zhang B, Zhang W, Yang H. S 3H: Long-tailed classification via spatial constraint sampling, scalable network, and hybrid task. Neural Netw 2025; 185:107247. [PMID: 39938439 DOI: 10.1016/j.neunet.2025.107247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Revised: 12/07/2024] [Accepted: 01/31/2025] [Indexed: 02/14/2025]
Abstract
Long-tailed classification is a significant yet challenging vision task that aims to making the clearest decision boundaries via integrating semantic consistency and texture characteristics. Unlike prior methods, we design spatial constraint sampling and scalable network to bolster the extraction of well-balanced features during training process. Simultaneously, we propose hybrid task to optimize models, which integrates single-model classification and cross-model contrastive learning complementarity to capture comprehensive features. Concretely, the sampling strategy meticulously furnishes the model with spatial constraint samples, encouraging the model to integrate high-level semantic and low-level texture representative features. The scalable network and hybrid task enable the features learned by the model to be dynamically adjusted and consistent with the true data distribution. Such manners effectively dismantle the constraints associated with multi-stage optimization, thereby ushering in innovative possibilities for the end-to-end training of long-tailed classification tasks. Extensive experiments demonstrate that our method achieves state-of-the-art performance on CIFAR10-LT, CIFAR100-LT, ImageNet-LT, and iNaturalist 2018 datasets. The codes and model weights will be available at https://github.com/WilyZhao8/S3H.
Collapse
Affiliation(s)
- Wenyi Zhao
- School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China.
| | - Wei Li
- School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China.
| | - Yongqin Tian
- School of Information Engineering, Henan Institute of Science and Technology, Xinxiang, 453003, China.
| | - Enwen Hu
- School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China.
| | - Wentao Liu
- School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China.
| | - Bin Zhang
- School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China.
| | - Weidong Zhang
- School of Information Engineering, Henan Institute of Science and Technology, Xinxiang, 453003, China.
| | - Huihua Yang
- School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China.
| |
Collapse
|
22
|
Wei W, Ye Y, Chen G, Zhao Y, Yang X, Zhang L, Zhang Y. SAR remote sensing image segmentation based on feature enhancement. Neural Netw 2025; 185:107190. [PMID: 39884178 DOI: 10.1016/j.neunet.2025.107190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 12/10/2024] [Accepted: 01/16/2025] [Indexed: 02/01/2025]
Abstract
Synthetic aperture radar (SAR) images are crucial in remote sensing due to their ability to capture high-quality images regardless of environmental conditions. Though it has been studied for years, the following aspects still limit its further improvement. (1) Due to the unique imaging mechanism of SAR images, the influence of speckle noise cannot be avoided. (2) High-resolution SAR remote sensing images contain complex surface features, and the intersection of multiple targets makes boundary information unclear. To address these problems, we propose a SAR remote sensing image segmentation method based on feature enhancement. Specifically, we propose utilizing wavelet transform on the original SAR remote sensing image along with an encoder-decoder network to learn the structural features. This approach enhances the feature expression and mitigates the impact of speckle noise. Secondly, we design a post-processing refinement module that consists of a small cascaded encoder-decoder. This module refines the segmentation results, making the boundary information clearer. Finally, to further enhance the segmentation results, we incorporate a self-distillation module into the encoder. This enhances hierarchical interaction in the encoder, enabling better learning of semantic information by the shallow layer for segmentation. Two SAR image segmentation datasets demonstrate the effectiveness of the proposed method.
Collapse
Affiliation(s)
- Wei Wei
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China.
| | - Yanyu Ye
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China
| | - Guochao Chen
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China
| | - Yuming Zhao
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China
| | - Xin Yang
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China
| | - Lei Zhang
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China.
| | - Yanning Zhang
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China
| |
Collapse
|
23
|
Zu K, Zhang H, Zhang L, Lu J, Xu C, Chen H, Zheng Y. EMBANet: A flexible efficient multi-branch attention network. Neural Netw 2025; 185:107248. [PMID: 39951863 DOI: 10.1016/j.neunet.2025.107248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2024] [Revised: 01/09/2025] [Accepted: 02/02/2025] [Indexed: 02/17/2025]
Abstract
Recent advances in the design of convolutional neural networks have shown that performance can be enhanced by improving the ability to represent multi-scale features. However, most existing methods either focus on designing more sophisticated attention modules, which leads to higher computational costs, or fail to effectively establish long-range channel dependencies, or neglect the extraction and utilization of structural information. This work introduces a novel module, the Multi-Branch Concatenation (MBC), designed to process input tensors and extract multi-scale feature maps. The MBC module introduces new degrees of freedom (DoF) in the design of attention networks by allowing for flexible adjustments to the types of transformation operators and the number of branches. This study considers two key transformation operators: multiplexing and splitting, both of which facilitate a more granular representation of multi-scale features and enhance the receptive field range. By integrating the MBC with an attention module, a Multi-Branch Attention (MBA) module is developed to capture channel-wise interactions within feature maps, thereby establishing long-range channel dependencies. Replacing the 3x3 convolutions in the bottleneck blocks of ResNet with the proposed MBA yields a new block, the Efficient Multi-Branch Attention (EMBA), which can be seamlessly integrated into state-of-the-art backbone CNN models. Furthermore, a new backbone network, named EMBANet, is constructed by stacking EMBA blocks. The proposed EMBANet has been thoroughly evaluated across various computer vision tasks, including classification, detection, and segmentation, consistently demonstrating superior performance compared to popular backbones.
Collapse
Affiliation(s)
- Keke Zu
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China; Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Shenzhen University, Shenzhen, China.
| | - Hu Zhang
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Shenzhen University, Shenzhen, China.
| | - Lei Zhang
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China.
| | - Jian Lu
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Shenzhen University, Shenzhen, China.
| | - Chen Xu
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Shenzhen University, Shenzhen, China.
| | - Hongyang Chen
- Research Center for Graph Computing, Zhejiang Lab, Hangzhou, China.
| | - Yu Zheng
- JD Intelligent Cities Research, Beijing, China.
| |
Collapse
|
24
|
Zhan Q, Zeng XJ, Wang Q. Reducing bias in source-free unsupervised domain adaptation for regression. Neural Netw 2025; 185:107161. [PMID: 39862532 DOI: 10.1016/j.neunet.2025.107161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Revised: 11/26/2024] [Accepted: 01/10/2025] [Indexed: 01/27/2025]
Abstract
Due to data privacy and storage concerns, Source-Free Unsupervised Domain Adaptation (SFUDA) focuses on improving an unlabelled target domain by leveraging a pre-trained source model without access to source data. While existing studies attempt to train target models by mitigating biases induced by noisy pseudo labels, they often lack theoretical guarantees for fully reducing biases and have predominantly addressed classification tasks rather than regression ones. To address these gaps, our analysis delves into the generalisation error bound of the target model, aiming to understand the intrinsic limitations of pseudo-label-based SFUDA methods. Theoretical results reveal that biases influencing generalisation error extend beyond the commonly highlighted label inconsistency bias, which denotes the mismatch between pseudo labels and ground truths, and the feature-label mapping bias, which represents the difference between the proxy target regressor and the real target regressor. Equally significant is the feature misalignment bias, indicating the misalignment between the estimated and real target feature distributions. This factor is frequently neglected or not explicitly addressed in current studies. Additionally, the label inconsistency bias can be unbounded in regression due to the continuous label space, further complicating SFUDA for regression tasks. Guided by these theoretical insights, we propose a Bias-Reduced Regression (BRR) method for SFUDA in regression. This method incorporates Feature Distribution Alignment (FDA) to reduce the feature misalignment bias, Hybrid Reliability Evaluation (HRE) to reduce the feature-label mapping bias and pseudo label updating to mitigate the label inconsistency bias. Experiments demonstrate the superior performance of the proposed BRR, and the effectiveness of FDA and HRE in reducing biases for regression tasks in SFUDA.
Collapse
Affiliation(s)
- Qianshan Zhan
- Department of Computer Science, University of Manchester, Oxford Rd, Manchester, M13 9PL, United Kingdom.
| | - Xiao-Jun Zeng
- Department of Computer Science, University of Manchester, Oxford Rd, Manchester, M13 9PL, United Kingdom.
| | - Qian Wang
- Luca Healthcare R&D, Shanghai, 200000, China.
| |
Collapse
|
25
|
Xiao Z, He B, Chen Z, Peng R, Zeng Q. SDRD-Net: A Symmetric Dual-branch Residual Dense Network for OCT and US Image Fusion. ULTRASOUND IN MEDICINE & BIOLOGY 2025; 51:884-895. [PMID: 39956705 DOI: 10.1016/j.ultrasmedbio.2025.02.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/06/2024] [Revised: 01/07/2025] [Accepted: 02/01/2025] [Indexed: 02/18/2025]
Abstract
Ultrasound (US) images have the advantages of no radiation, high penetration, and real-time imaging, and optical coherence tomography (OCT) has the advantage of high resolution. The purpose of fusing endometrial images from optical coherence tomography (OCT) and ultrasound (US) is to combine the advantages of different modalities to ultimately obtain more complete information on endometrial thickness. To better integrate multimodal images, we first proposed a Symmetric Dual-branch Residual Dense (SDRD-Net) network for OCT and US endometrial image fusion. Firstly, using Multi-scale Residual Dense Blocks (MRDB) to extract shallow features of different modalities. Then, the Base Transformer Module (BTM) and Detail Extraction Module (DEM) are used to extract primary and advanced features. Finally, the primary and advanced features are decomposed and recombined through the Feature Fusion Module (FMM), and the fused image is output. We have conducted experiments across both private and public datasets, encompassing IVF and MIF tasks, achieving commendable results.
Collapse
Affiliation(s)
- Zhang Xiao
- College of Mechanical Engineering, University of South China, Hengyang, Hunan, China; Key Laboratory of Medical Imaging Precision Theranostics and Radiation Protection, University of South China, College of Hunan Province, Changsha, Hunan, China
| | - Bin He
- College of Mechanical Engineering, University of South China, Hengyang, Hunan, China
| | - Zhiyi Chen
- Key Laboratory of Medical Imaging Precision Theranostics and Radiation Protection, University of South China, College of Hunan Province, Changsha, Hunan, China; Institution of Medical Imaging, Hengyang Medical School, University of South China, Hengyang, Hunan, China; Department of Medical Imaging, The Affiliated Changsha Central Hospital, Hengyang Medical School, University of South China, Changsha, Hunan, China.
| | - Rushu Peng
- College of Mechanical Engineering, University of South China, Hengyang, Hunan, China
| | - Qinghao Zeng
- College of Mechanical Engineering, University of South China, Hengyang, Hunan, China
| |
Collapse
|
26
|
Zhang Y, Li J, Ji Q, Li K, Liu L, Zheng C, Qiang W. Intervening on few-shot object detection based on the front-door criterion. Neural Netw 2025; 185:107251. [PMID: 39946764 DOI: 10.1016/j.neunet.2025.107251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 01/02/2025] [Accepted: 02/02/2025] [Indexed: 03/09/2025]
Abstract
Most few-shot object detection methods aim to utilize the learned generalizable knowledge from base categories to identify instances of novel categories. The fundamental assumption of these approaches is that the model can acquire sufficient transferable knowledge through the learning of base categories. However, our motivating experiments reveal a phenomenon that the model is overfitted to the data of base categories. To discuss the impact of this phenomenon on detection from a causal perspective, we develop a Structural Causal Model involving two key variables, causal generative factors and spurious generative factors. Both variables are derived from the base categories. Generative factors are latent variables or features that are used to control image generation. Causal generative factors are general generative factors that directly influence the generation process, while spurious generative factors are specific to certain categories, specifically the base categories in the problem we are analyzing. We recognize that the essence of the few-shot object detection methods lies in modeling the statistic dependence between novel object instances and their corresponding categories determined by the causal generative factors, while the set of spurious generative factors serves as a confounder in the modeling process. To mitigate the misleading impact of the spurious generative factors, we propose the Front-door Regulator guided by the front-door criterion. Front-door Regulator consists of two plug-and-play regularization terms, namely Semantic Grouping and Semantic Decoupling. We substantiate the effectiveness of our proposed method through experiments conducted on multiple benchmark datasets.
Collapse
Affiliation(s)
- Yanan Zhang
- University of Chinese Academy of Sciences, Beijing, China; National Key Laboratory of Space Integrated Information System, Institute of Software, Chinese Academy of Sciences, Beijing, China
| | - Jiangmeng Li
- National Key Laboratory of Space Integrated Information System, Institute of Software, Chinese Academy of Sciences, Beijing, China
| | - Qirui Ji
- University of Chinese Academy of Sciences, Beijing, China; National Key Laboratory of Space Integrated Information System, Institute of Software, Chinese Academy of Sciences, Beijing, China
| | - Kai Li
- National Key Laboratory of Space Integrated Information System, Institute of Software, Chinese Academy of Sciences, Beijing, China
| | - Lixiang Liu
- University of Chinese Academy of Sciences, Beijing, China; National Key Laboratory of Space Integrated Information System, Institute of Software, Chinese Academy of Sciences, Beijing, China
| | - Changwen Zheng
- University of Chinese Academy of Sciences, Beijing, China; National Key Laboratory of Space Integrated Information System, Institute of Software, Chinese Academy of Sciences, Beijing, China
| | - Wenwen Qiang
- National Key Laboratory of Space Integrated Information System, Institute of Software, Chinese Academy of Sciences, Beijing, China.
| |
Collapse
|
27
|
Gao L, Liu K, Guan L. A discriminative multi-modal adaptation neural network model for video action recognition. Neural Netw 2025; 185:107114. [PMID: 39827837 DOI: 10.1016/j.neunet.2024.107114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 12/10/2024] [Accepted: 12/28/2024] [Indexed: 01/22/2025]
Abstract
Research on video-based understanding and learning has attracted widespread interest and has been adopted in various real applications, such as e-healthcare, action recognition, affective computing, to name a few. Amongst them, video-based action recognition is one of the most representative examples. With the advancement of multi-sensory technology, action recognition using multi-modal data has recently drawn wide attention. However, the research community faces new challenges in effectively exploring and utilizing the discriminative and complementary information across different modalities. Although score level fusion approaches have been popularly employed for multi-modal action recognition, they simply add the scores derived separately from different modalities without proper consideration of cross-modality semantics amongst multiple input data sources, invariably causing sub-optimal performance. To address this issue, this paper presents a two-stream heterogeneous network to extract and jointly process complementary features derived from RGB and skeleton modalities, respectively. Then, a discriminative multi-modal adaptation neural network model (DMANNM) is proposed and applied to the heterogeneous network, by integrating statistical machine learning (SML) principles with convolutional neural network (CNN) architecture. In addition, to achieve high recognition accuracy by the generated multi-modal structure, an effective nonlinear classification algorithm is presented in this work. Leveraging the joint strength of SML and CNN architecture, the proposed model forms an adaptive platform for handling datasets of different scales. To demonstrate the effectiveness and the generic nature of the proposed model, we conducted experiments on four popular video-based action recognition datasets with different scales: NTU RGB+D, NTU RGB+D 120, Northwestern-UCLA (N-UCLA), and SYSU. The experimental results show the superiority of the proposed method over state-of-the-art compared.
Collapse
Affiliation(s)
- Lei Gao
- Department of Electrical, Computer and Biomedical Engineering, Toronto Metropolitan University, Toronto, Canada
| | - Kai Liu
- Department of Electrical, Computer and Biomedical Engineering, Toronto Metropolitan University, Toronto, Canada.
| | - Ling Guan
- Department of Electrical, Computer and Biomedical Engineering, Toronto Metropolitan University, Toronto, Canada
| |
Collapse
|
28
|
Yang H, Xu Y, Liu X. DKiS: Decay weight invertible image steganography with private key. Neural Netw 2025; 185:107148. [PMID: 39827833 DOI: 10.1016/j.neunet.2025.107148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Revised: 12/12/2024] [Accepted: 01/10/2025] [Indexed: 01/22/2025]
Abstract
Image steganography, defined as the practice of concealing information within another image. In this paper, we propose decay weight invertible image steganography with private key (DKiS). This model introduces two major advancements into current invertible image steganography: (1) Decay Weight Mechanism: For the first time, we introduce a decay weight mechanism to regulate the transfer of non-essential or 'garbage' information from the secret to the host pipeline. This effectively filters out irrelevant data, enhancing the performance of image steganography. (2) Preset Private Key Integration: We incorporate a preset private key into high-capacity image steganography for the first time, strengthening the security of hidden information. Access to the concealed data requires the corresponding preset private key, effectively addressing security challenges when the model becomes publicly known or are subject to attack. Experimental results demonstrate the effectiveness of our model, highlighting its robustness and practical applicability in real-world scenarios. The code for this model is publicly accessible at https://github.com/yanghangAI/DKiS, and a practical demonstration can be found at http://yanghang.site/hidekey/.
Collapse
Affiliation(s)
- Hang Yang
- College of Science, China Agricultural University, Beijing 100083, China
| | - Yitian Xu
- College of Science, China Agricultural University, Beijing 100083, China.
| | - Xuhua Liu
- College of Science, China Agricultural University, Beijing 100083, China.
| |
Collapse
|
29
|
Dou Z, Ren H, Ma Y, Gao Y, Huang G, Ma X. One-step Multi-view Spectral Clustering with Subspaces Fusion on Grassmann manifold. Neurocomputing 2025; 626:129568. [DOI: 10.1016/j.neucom.2025.129568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/02/2025]
|
30
|
Cai X, Chen MS, Wang CD, Zhang H. Motif-aware curriculum learning for node classification. Neural Netw 2025; 184:107089. [PMID: 39756117 DOI: 10.1016/j.neunet.2024.107089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2024] [Revised: 12/19/2024] [Accepted: 12/21/2024] [Indexed: 01/07/2025]
Abstract
Node classification, seeking to predict the categories of unlabeled nodes, is a crucial task in graph learning. One of the most popular methods for node classification is currently Graph Neural Networks (GNNs). However, conventional GNNs assign equal importance to all training nodes, which can lead to a reduction in accuracy and robustness due to the influence of complex nodes information. In light of the potential benefits of curriculum learning, some studies have proposed to incorporate curriculum learning into GNNs , where the node information can be acquired in an orderly manner. Nevertheless, the existing curriculum learning-based node classification methods fail to consider the subgraph structural information. To address this issue, we propose a novel approach, Motif-aware Curriculum Learning for Node Classification (MACL). It emphasizes the role of motif structures within graphs to fully utilize subgraph information and measure the quality of nodes, supporting an organized learning process for GNNs. Specifically, we design a motif-aware difficulty measurer to evaluate the difficulty of training nodes from different perspectives. Furthermore, we have implemented a training scheduler to introduce appropriate training nodes to the GNNs at suitable times. We conduct extensive experiments on five representative datasets. The results show that incorporating MACL into GNNs can improve the accuracy.
Collapse
Affiliation(s)
- Xiaosha Cai
- School of Mathematics (Zhuhai), Sun Yat-sen University, Zhuhai 519082, China.
| | - Man-Sheng Chen
- School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, Guangdong 510275, China.
| | - Chang-Dong Wang
- School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, Guangdong 510275, China; Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, 510006, China.
| | - Haizhang Zhang
- School of Mathematics (Zhuhai), Sun Yat-sen University, Zhuhai 519082, China.
| |
Collapse
|
31
|
Zhang P, Liu Y, Lai S, Li H, Jin L. Privacy-Preserving Biometric Verification With Handwritten Random Digit String. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:3049-3066. [PMID: 40031072 DOI: 10.1109/tpami.2025.3529022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Handwriting verification has stood as a steadfast identity authentication method for decades. However, this technique risks potential privacy breaches due to the inclusion of personal information in handwritten biometrics such as signatures. To address this concern, we propose using the Random Digit String (RDS) for privacy-preserving handwriting verification. This approach allows users to authenticate themselves by writing an arbitrary digit sequence, effectively ensuring privacy protection. To evaluate the effectiveness of RDS, we construct a new HRDS4BV dataset composed of online naturally handwritten RDS. Unlike conventional handwriting, RDS encompasses unconstrained and variable content, posing significant challenges for modeling consistent personal writing style. To surmount this, we propose the Pattern Attentive VErification Network (PAVENet), along with a Discriminative Pattern Mining (DPM) module. DPM adaptively enhances the recognition of consistent and discriminative writing patterns, thus refining handwriting style representation. Through comprehensive evaluations, we scrutinize the applicability of online RDS verification and showcase a pronounced outperformance of our model over existing methods. Furthermore, we discover a noteworthy forgery phenomenon that deviates from prior findings and discuss its positive impact in countering malicious impostor attacks. Substantially, our work underscores the feasibility of privacy-preserving biometric verification and propels the prospects of its broader acceptance and application.
Collapse
|
32
|
Liu Y, Huang M, Yan H, Deng L, Wu W, Lu H, Shen C, Jin L, Bai X. VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-Domain Generalization. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:2957-2972. [PMID: 40031074 DOI: 10.1109/tpami.2025.3528950] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368 k) by leveraging the Content Deformation Fields (CoDeF) algorithm. Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data. We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data.
Collapse
|
33
|
Liao X, Wei X, Zhou M, Wong HS, Kwong S. Image Quality Assessment: Exploring Joint Degradation Effect of Deep Network Features via Kernel Representation Similarity Analysis. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:2799-2815. [PMID: 40031058 DOI: 10.1109/tpami.2025.3527004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Typically, deep network-based full-reference image quality assessment (FR-IQA) models compare deep features from reference and distorted images pairwise, overlooking correlations among features from the same source. We propose a dual-branch framework to capture the joint degradation effect among deep network features. The first branch uses kernel representation similarity analysis (KRSA), which compares feature self-similarity matrices via the mean absolute error (MAE). The second branch conducts pairwise comparisons via the MAE, and a training-free logarithmic summation of both branches derives the final score. Our approach contributes in three ways. First, integrating the KRSA with pairwise comparisons enhances the model's perceptual awareness. Second, our approach is adaptable to diverse network architectures. Third, our approach can guide perceptual image enhancement. Extensive experiments on 10 datasets validate our method's efficacy, demonstrating that perceptual deformation widely exists in diverse IQA scenarios and that measuring the joint degradation effect can discern appealing content deformations.
Collapse
|
34
|
Ning Z, Yang B, Wang Y, Shi Z, Yu J, Wu G. Dual-path neural network extracts tumor microenvironment information from whole slide images to predict molecular typing and prognosis of Glioma. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2025; 261:108580. [PMID: 39809091 DOI: 10.1016/j.cmpb.2024.108580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/29/2024] [Revised: 12/28/2024] [Accepted: 12/29/2024] [Indexed: 01/16/2025]
Abstract
BACKGROUND AND OBJECTIVE Utilizing AI to mine tumor microenvironment information in whole slide images (WSIs) for glioma molecular subtype and prognosis prediction is significant for treatment. Existing weakly-supervised learning frameworks based on multi-instance learning have potential in WSIs analysis, but the large number of patches from WSIs challenges the effective extraction of key local patch and neighboring patch microenvironment info. Therefore, this paper aims to develop an automatic neural network that effectively extracts tumor microenvironment information from WSIs to predict molecular typing and prognosis of glioma. METHODS In this paper, we proposed a dual-path pathology analysis (DPPA) framework to enhance the analysis ability of WSIs for glioma diagnosis. Firstly, to mitigate the impact of redundant patches and enhance the integration of salient patch information within a multi-instance learning context, we propose a two-stage attention-based dynamic multi-instance learning network. In the network, two-stage attention and dynamic random sampling are designed to integrate diverse image patch information in pivotal regions adaptively. Secondly, to unearth the wealth of spatial context inherent in WSIs, we build a spatial relationship information quantification module. This module captures the spatial distribution of patches that encompass a variety of tissue structures, shedding light on the tumor microenvironment. RESULTS A large number of experiments on three datasets, two in-house and one public, totaling 1,795 WSIs demonstrate the encouraging performance of the DPPA, with mean area under curves of 0.94, 0.85, and 0.88 in predicting Isocitrate Dehydrogenase 1, Telomerase Reverse Tranase, and 1p/19q respectively, and a mean C-index of 0.82 in prognosis prediction. The proposed model can also group tumors in existing tumor subgroups into good and bad prognoses, with P < 0.05 on the Log-rank test. CONCLUSIONS The results of multi-center experiments demonstrate that the proposed DPPA surpasses the state-of-the-art models across multiple metrics. Through ablation experiments and survival analysis, the outstanding analytical ability of this model is further validated. Meanwhile, based on the work related to the interpretability of the model, the reliability and validity of the model have also been strongly confirmed. All source codes are released at: https://github.com/nzehang97/DPPA.
Collapse
Affiliation(s)
- Zehang Ning
- School of Information Science and Technology, Fudan University, Shanghai, 200433, China; Key Laboratory of Medical Imaging, Computing and Computer Assisted Intervention, Shanghai, 200433, China
| | - Bojie Yang
- Department of Neurosurgery, Huashan Hospital, Fudan University, Shanghai, 200433, China; AI Lab of Department of Neurosurgery, Huashan Hospital, Fudan University, Shanghai, 200433, China
| | - Yuanyuan Wang
- School of Information Science and Technology, Fudan University, Shanghai, 200433, China; Key Laboratory of Medical Imaging, Computing and Computer Assisted Intervention, Shanghai, 200433, China; AI Lab of Department of Neurosurgery, Huashan Hospital, Fudan University, Shanghai, 200433, China
| | - Zhifeng Shi
- Department of Neurosurgery, Huashan Hospital, Fudan University, Shanghai, 200433, China; AI Lab of Department of Neurosurgery, Huashan Hospital, Fudan University, Shanghai, 200433, China.
| | - Jinhua Yu
- School of Information Science and Technology, Fudan University, Shanghai, 200433, China; Key Laboratory of Medical Imaging, Computing and Computer Assisted Intervention, Shanghai, 200433, China; AI Lab of Department of Neurosurgery, Huashan Hospital, Fudan University, Shanghai, 200433, China.
| | - Guoqing Wu
- School of Information Science and Technology, Fudan University, Shanghai, 200433, China; Key Laboratory of Medical Imaging, Computing and Computer Assisted Intervention, Shanghai, 200433, China.
| |
Collapse
|
35
|
Zhang Z, Zhang J, Mai W. VPT: Video portraits transformer for realistic talking face generation. Neural Netw 2025; 184:107122. [PMID: 39799718 DOI: 10.1016/j.neunet.2025.107122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 12/09/2024] [Accepted: 01/02/2025] [Indexed: 01/15/2025]
Abstract
Talking face generation is a promising approach within various domains, such as digital assistants, video editing, and virtual video conferences. Previous works with audio-driven talking faces focused primarily on the synchronization between audio and video. However, existing methods still have certain limitations in synthesizing photo-realistic video with high identity preservation, audiovisual synchronization, and facial details like blink movements. To solve these problems, a novel talking face generation framework, termed video portraits transformer (VPT) with controllable blink movements is proposed and applied. It separates the process of video generation into two stages, i.e., audio-to-landmark and landmark-to-face stages. In the audio-to-landmark stage, the transformer encoder serves as the generator used for predicting whole facial landmarks from given audio and continuous eye aspect ratio (EAR). During the landmark-to-face stage, the video-to-video (vid-to-vid) network is employed to transfer landmarks into realistic talking face videos. Moreover, to imitate real blink movements during inference, a transformer-based spontaneous blink generation module is devised to generate the EAR sequence. Extensive experiments demonstrate that the VPT method can produce photo-realistic videos of talking faces with natural blink movements, and the spontaneous blink generation module can generate blink movements close to the real blink duration distribution and frequency.
Collapse
Affiliation(s)
- Zhijun Zhang
- School of Automation Science and Engineering, South China University of Technology, China; Key Library of Autonomous Systems and Network Control, Ministry of Education, China; Jiangxi Thousand Talents Plan, Nanchang University, China; College of Computer Science and Engineering, Jishou University, China; Guangdong Artificial Intelligence and Digital Economy Laboratory (Pazhou Lab), China; Shaanxi Provincial Key Laboratory of Industrial Automation, School of Mechanical Engineering, Shaanxi University of Technology, Hanzhong, China; School of Information Science and Engineering, Changsha Normal University, Changsha, China; School of Automation Science and Engineering, and also with the Institute of Artificial Intelligence and Automation, Guangdong University of Petrochemical Technology, Maoming, China; Key Laboratory of Large-Model Embodied-Intelligent Humanoid Robot (2024KSYS004), China; The Institute for Super Robotics (Huangpu), Guangzhou,, China.
| | - Jian Zhang
- School of Automation Science and Engineering, South China University of Technology, China; The Institute for Super Robotics (Huangpu), Guangzhou,, China.
| | - Weijian Mai
- School of Automation Science and Engineering, South China University of Technology, China.
| |
Collapse
|
36
|
Yue Z, Shi M. Enhancing space-time video super-resolution via spatial-temporal feature interaction. Neural Netw 2025; 184:107033. [PMID: 39705772 DOI: 10.1016/j.neunet.2024.107033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 08/05/2024] [Accepted: 12/06/2024] [Indexed: 12/23/2024]
Abstract
The target of space-time video super-resolution (STVSR) is to increase both the frame rate (also referred to as the temporal resolution) and the spatial resolution of a given video. Recent approaches solve STVSR using end-to-end deep neural networks. A popular solution is to first increase the frame rate of the video; then perform feature refinement among different frame features; and at last, increase the spatial resolutions of these features. The temporal correlation among features of different frames is carefully exploited in this process. The spatial correlation among features of different (spatial) resolutions, despite being also very important, is however not emphasized. In this paper, we propose a spatial-temporal feature interaction network to enhance STVSR by exploiting both spatial and temporal correlations among features of different frames and spatial resolutions. Specifically, the spatial-temporal frame interpolation module is introduced to interpolate low- and high-resolution intermediate frame features simultaneously and interactively. The spatial-temporal local and global refinement modules are respectively deployed afterwards to exploit the spatial-temporal correlation among different features for their refinement. Finally, a novel motion consistency loss is employed to enhance the motion continuity among reconstructed frames. We conduct experiments on three standard benchmarks, Vid4, Vimeo-90K and Adobe240, and the results demonstrate that our method improves the state-of-the-art methods by a considerable margin. Our codes will be available at https://github.com/yuezijie/STINet-Space-time-Video-Super-resolution.
Collapse
Affiliation(s)
- Zijie Yue
- College of Electronic and Information Engineering, Tongji University, China
| | - Miaojing Shi
- College of Electronic and Information Engineering, Tongji University, China; Shanghai Institute of Intelligent Science and Technology, Tongji University, China.
| |
Collapse
|
37
|
Adler TJ, Nölke JH, Reinke A, Tizabi MD, Gruber S, Trofimova D, Ardizzone L, Jaeger PF, Buettner F, Köthe U, Maier-Hein L. Application-driven validation of posteriors in inverse problems. Med Image Anal 2025; 101:103474. [PMID: 39892221 DOI: 10.1016/j.media.2025.103474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 01/13/2025] [Accepted: 01/15/2025] [Indexed: 02/03/2025]
Abstract
Current deep learning-based solutions for image analysis tasks are commonly incapable of handling problems to which multiple different plausible solutions exist. In response, posterior-based methods such as conditional Diffusion Models and Invertible Neural Networks have emerged; however, their translation is hampered by a lack of research on adequate validation. In other words, the way progress is measured often does not reflect the needs of the driving practical application. Closing this gap in the literature, we present the first systematic framework for the application-driven validation of posterior-based methods in inverse problems. As a methodological novelty, it adopts key principles from the field of object detection validation, which has a long history of addressing the question of how to locate and match multiple object instances in an image. Treating modes as instances enables us to perform mode-centric validation, using well-interpretable metrics from the application perspective. We demonstrate the value of our framework through instantiations for a synthetic toy example and two medical vision use cases: pose estimation in surgery and imaging-based quantification of functional tissue parameters for diagnostics. Our framework offers key advantages over common approaches to posterior validation in all three examples and could thus revolutionize performance assessment in inverse problems.
Collapse
Affiliation(s)
- Tim J Adler
- German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems (IMSY), Heidelberg, Germany
| | - Jan-Hinrich Nölke
- German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems (IMSY), Heidelberg, Germany; Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany.
| | - Annika Reinke
- German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems (IMSY), Heidelberg, Germany; German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany
| | - Minu Dietlinde Tizabi
- German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems (IMSY), Heidelberg, Germany; National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and University Medical Center Heidelberg, Heidelberg, Germany
| | - Sebastian Gruber
- German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems (IMSY), Heidelberg, Germany
| | - Dasha Trofimova
- German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems (IMSY), Heidelberg, Germany
| | - Lynton Ardizzone
- Visual Learning Lab, Interdisciplinary Center for Scientific Computing (IWR), Heidelberg, Germany
| | - Paul F Jaeger
- German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany; German Cancer Research Center (DKFZ) Heidelberg, Interactive Machine Learning Group, Heidelberg, Germany
| | - Florian Buettner
- Department of Informatics, Goethe University Frankfurt, Frankfurt, Germany; Department of Medicine, Goethe University Frankfurt, Frankfurt, Germany; German Cancer Consortium (DKTK), partner site Frankfurt, a partnership between DKFZ and UCT Frankfurt-Marburg, Frankfurt, Germany; German Cancer Research Center (DKFZ), Heidelberg, Germany; Frankfurt Cancer Institute, Frankfurt, Germany
| | - Ullrich Köthe
- Visual Learning Lab, Interdisciplinary Center for Scientific Computing (IWR), Heidelberg, Germany
| | - Lena Maier-Hein
- German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems (IMSY), Heidelberg, Germany; Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany; German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany; National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and University Medical Center Heidelberg, Heidelberg, Germany.
| |
Collapse
|
38
|
Ji J, Feng S. Anchors Crash Tensor: Efficient and Scalable Tensorial Multi-View Subspace Clustering. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:2660-2675. [PMID: 40031059 DOI: 10.1109/tpami.2025.3526790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Tensorial Multi-view Clustering (TMC), a prominent approach in multi-view clustering, leverages low-rank tensor learning to capture high-order correlation among views for consistent clustering structure identification. Despite its promising performance, the TMC algorithms face three key challenges: 1). The severe computational burden makes it difficult for TMC methods to handle large-scale datasets. 2). Estimation bias problem caused by the convex surrogate of the tensor rank. 3). Lack of explicit balance of consistency and complementarity. Being aware of these, we propose a basic framework Efficient and Scalable Tensorial Multi-View Subspace Clustering (ESTMC) for large-scale multi-view clustering. ESTMC integrates anchor representation learning and non-convex function-based low-rank tensor learning with a Generalized Non-convex Tensor Rank (GNTR) into a unified objective function, which enhances the efficiency of the existing subspace-based TMC framework. Furthermore, a novel model ESTMC-C with the proposed Enhanced Tensor Rank (ETR), Consistent Geometric Regularization (CGR), and Tensorial Exclusive Regularization (TER) is extended to balance the learning of consistency and complementarity among views, delivering divisible representations for the clustering task. Efficient iterative optimization algorithms are designed to solve the proposed ESTMC and ESTMC-C, which enjoy time-economical complexity and exhibit theoretical convergence. Extensive experimental results on various datasets demonstrate the superiority of the proposed algorithms as compared to state-of-the-art methods.
Collapse
|
39
|
Sun L, Gehrig D, Sakaridis C, Gehrig M, Liang J, Sun P, Xu Z, Wang K, Van Gool L, Scaramuzza D. A Unified Framework for Event-Based Frame Interpolation With Ad-Hoc Deblurring in the Wild. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:2265-2279. [PMID: 40030478 DOI: 10.1109/tpami.2024.3510690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Effective video frame interpolation hinges on the adept handling of motion in the input scene. Prior work acknowledges asynchronous event information for this, but often overlooks whether motion induces blur in the video, limiting its scope to sharp frame interpolation. We instead propose a unified framework for event-based frame interpolation that performs deblurring ad-hoc and thus works both on sharp and blurry input videos. Our model consists in a bidirectional recurrent network that incorporates the temporal dimension of interpolation and fuses information from the input frames and the events adaptively based on their temporal proximity. To enhance the generalization from synthetic data to real event cameras, we integrate self-supervised framework with the proposed model to enhance the generalization on real-world datasets in the wild. At the dataset level, we introduce a novel real-world high-resolution dataset with events and color videos named HighREV, which provides a challenging evaluation setting for the examined task. Extensive experiments show that our network consistently outperforms previous state-of-the-art methods on frame interpolation, single image deblurring, and the joint task of both. Experiments on domain transfer reveal that self-supervised training effectively mitigates the performance degradation observed when transitioning from synthetic data to real-world data. Code and datasets are available at https://github.com/AHupuJR/REFID.
Collapse
|
40
|
Lin M, Liu J, Zhang C, Zhao Z, He C, Yu L. Non-Uniform Exposure Imaging via Neuromorphic Shutter Control. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:2770-2784. [PMID: 40031061 DOI: 10.1109/tpami.2025.3526280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
By leveraging the blur-noise trade-off, imaging with non-uniform exposures largely extends the image acquisition flexibility in harsh environments. However, the limitation of conventional cameras in perceiving intra-frame dynamic information prevents existing methods from being implemented in the real-world frame acquisition for real-time adaptive camera shutter control. To address this challenge, we propose a novel Neuromorphic Shutter Control (NSC) system to avoid motion blur and alleviate instant noise, where the extremely low latency of events is leveraged to monitor the real-time motion and facilitate the scene-adaptive exposure. Furthermore, to stabilize the inconsistent Signal-to-Noise Ratio (SNR) caused by the non-uniform exposure times, we propose an event-based image denoising network within a self-supervised learning paradigm, i.e., SEID, exploring the statistics of image noise and inter-frame motion information of events to obtain artificial supervision signals for high-quality imaging in real-world scenes. To illustrate the effectiveness of the proposed NSC, we implement it in hardware by building a hybrid-camera imaging prototype system, with which we collect a real-world dataset containing well-synchronized frames and events in diverse scenarios with different target scenes and motion patterns. Experiments on the synthetic and real-world datasets demonstrate the superiority of our method over state-of-the-art approaches.
Collapse
|
41
|
Oh K, Heo DW, Mulyadi AW, Jung W, Kang E, Lee KH, Suk HI. A quantitatively interpretable model for Alzheimer's disease prediction using deep counterfactuals. Neuroimage 2025; 309:121077. [PMID: 39954872 DOI: 10.1016/j.neuroimage.2025.121077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2024] [Revised: 01/19/2025] [Accepted: 02/05/2025] [Indexed: 02/17/2025] Open
Abstract
Deep learning (DL) for predicting Alzheimer's disease (AD) has provided timely intervention in disease progression yet still demands attentive interpretability to explain how their DL models make definitive decisions. Counterfactual reasoning has recently gained increasing attention in medical research because of its ability to provide a refined visual explanatory map. However, such visual explanatory maps based on visual inspection alone are insufficient unless we intuitively demonstrate their medical or neuroscientific validity via quantitative features. In this study, we synthesize the counterfactual-labeled structural MRIs using our proposed framework and transform it into a gray matter density map to measure its volumetric changes over the parcellated region of interest (ROI). We also devised a lightweight linear classifier to boost the effectiveness of constructed ROIs, promoted quantitative interpretation, and achieved comparable predictive performance to DL methods. Throughout this, our framework produces an "AD-relatedness index" for each ROI. It offers an intuitive understanding of brain status for an individual patient and across patient groups concerning AD progression.
Collapse
Affiliation(s)
- Kwanseok Oh
- Department of Artificial Intelligence, Korea University, Seoul 02841, Republic of Korea
| | - Da-Woon Heo
- Department of Artificial Intelligence, Korea University, Seoul 02841, Republic of Korea
| | - Ahmad Wisnu Mulyadi
- Department of Brain and Cognitive Engineering, Korea University, Seoul 02841, Republic of Korea
| | - Wonsik Jung
- Department of Brain and Cognitive Engineering, Korea University, Seoul 02841, Republic of Korea
| | - Eunsong Kang
- Department of Brain and Cognitive Engineering, Korea University, Seoul 02841, Republic of Korea
| | - Kun Ho Lee
- Gwangju Alzheimer's & Related Dementia Cohort Research Center, Chosun University, Gwangju 61452, Republic of Korea; Department of Biomedical Science and Gwangju Alzheimer's & Related Dementia Cohort Research Center, Chosun University, Gwangju 61452, Republic of Korea; Korea Brain Research Institute, Daegu 41062, Republic of Korea.
| | - Heung-Il Suk
- Department of Artificial Intelligence, Korea University, Seoul 02841, Republic of Korea.
| |
Collapse
|
42
|
Wang Y, Wang Y, Khan ZA, Huang A, Sang J. Multi-level feature fusion networks for smoke recognition in remote sensing imagery. Neural Netw 2025; 184:107112. [PMID: 39793493 DOI: 10.1016/j.neunet.2024.107112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Revised: 12/10/2024] [Accepted: 12/28/2024] [Indexed: 01/13/2025]
Abstract
Smoke is a critical indicator of forest fires, often detectable before flames ignite. Accurate smoke identification in remote sensing images is vital for effective forest fire monitoring within Internet of Things (IoT) systems. However, existing detection methods frequently falter in complex real-world scenarios, where variable smoke shapes and sizes, intricate backgrounds, and smoke-like phenomena (e.g., clouds and haze) lead to missed detections and false alarms. To address these challenges, we propose the Multi-level Feature Fusion Network (MFFNet), a novel framework grounded in contrastive learning. MFFNet begins by extracting multi-scale features from remote sensing images using a pre-trained ConvNeXt model, capturing information across different levels of granularity to accommodate variations in smoke appearance. The Attention Feature Enhancement Module further refines these multi-scale features, enhancing fine-grained, discriminative attributes relevant to smoke detection. Subsequently, the Bilinear Feature Fusion Module combines these enriched features, effectively reducing background interference and improving the model's ability to distinguish smoke from visually similar phenomena. Finally, contrastive feature learning is employed to improve robustness against intra-class variations by focusing on unique regions within the smoke patterns. Evaluated on the benchmark dataset USTC_SmokeRS, MFFNet achieves an accuracy of 98.87%. Additionally, our model demonstrates a detection rate of 94.54% on the extended E_SmokeRS dataset, with a low false alarm rate of 3.30%. These results highlight the effectiveness of MFFNet in recognizing smoke in remote sensing images, surpassing existing methodologies. The code is accessible at https://github.com/WangYuPeng1/MFFNet.
Collapse
Affiliation(s)
- Yupeng Wang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Yongli Wang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Zaki Ahmad Khan
- Department of Computer Science, University of Worcester, Worcester, UK.
| | - Anqi Huang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Jianghui Sang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| |
Collapse
|
43
|
Nwoye CI, Padoy N. SurgiTrack: Fine-grained multi-class multi-tool tracking in surgical videos. Med Image Anal 2025; 101:103438. [PMID: 39708509 DOI: 10.1016/j.media.2024.103438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Revised: 12/08/2024] [Accepted: 12/10/2024] [Indexed: 12/23/2024]
Abstract
Accurate tool tracking is essential for the success of computer-assisted intervention. Previous efforts often modeled tool trajectories rigidly, overlooking the dynamic nature of surgical procedures, especially tracking scenarios like out-of-body and out-of-camera views. Addressing this limitation, the new CholecTrack20 dataset provides detailed labels that account for multiple tool trajectories in three perspectives: (1) intraoperative, (2) intracorporeal, and (3) visibility, representing the different types of temporal duration of tool tracks. These fine-grained labels enhance tracking flexibility but also increase the task complexity. Re-identifying tools after occlusion or re-insertion into the body remains challenging due to high visual similarity, especially among tools of the same category. This work recognizes the critical role of the tool operators in distinguishing tool track instances, especially those belonging to the same tool category. The operators' information are however not explicitly captured in surgical videos. We therefore propose SurgiTrack, a novel deep learning method that leverages YOLOv7 for precise tool detection and employs an attention mechanism to model the originating direction of the tools, as a proxy to their operators, for tool re-identification. To handle diverse tool trajectory perspectives, SurgiTrack employs a harmonizing bipartite matching graph, minimizing conflicts and ensuring accurate tool identity association. Experimental results on CholecTrack20 demonstrate SurgiTrack's effectiveness, outperforming baselines and state-of-the-art methods with real-time inference capability. This work sets a new standard in surgical tool tracking, providing dynamic trajectories for more adaptable and precise assistance in minimally invasive surgeries.
Collapse
Affiliation(s)
- Chinedu Innocent Nwoye
- University of Strasbourg, CAMMA, ICube, CNRS, INSERM, France; IHU Strasbourg, Strasbourg, France.
| | - Nicolas Padoy
- University of Strasbourg, CAMMA, ICube, CNRS, INSERM, France; IHU Strasbourg, Strasbourg, France
| |
Collapse
|
44
|
Zhong J, Tian W, Xie Y, Liu Z, Ou J, Tian T, Zhang L. PMFSNet: Polarized multi-scale feature self-attention network for lightweight medical image segmentation. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2025; 261:108611. [PMID: 39892086 DOI: 10.1016/j.cmpb.2025.108611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Revised: 01/05/2025] [Accepted: 01/19/2025] [Indexed: 02/03/2025]
Abstract
BACKGROUND AND OBJECTIVES Current state-of-the-art medical image segmentation methods prioritize precision but often at the expense of increased computational demands and larger model sizes. Applying these large-scale models to the relatively limited scale of medical image datasets tends to induce redundant computation, complicating the process without the necessary benefits. These approaches increase complexity and pose challenges for integrating and deploying lightweight models on edge devices. For instance, recent transformer-based models have excelled in 2D and 3D medical image segmentation due to their extensive receptive fields and high parameter count. However, their effectiveness comes with the risk of overfitting when applied to small datasets. It often neglects the vital inductive biases of Convolutional Neural Networks (CNNs), essential for local feature representation. METHODS In this work, we propose PMFSNet, a novel medical imaging segmentation model that effectively balances global and local feature processing while avoiding the computational redundancy typical of larger models. PMFSNet streamlines the UNet-based hierarchical structure and simplifies the self-attention mechanism's computational complexity, making it suitable for lightweight applications. It incorporates a plug-and-play PMFS block, a multi-scale feature enhancement module based on attention mechanisms, to capture long-term dependencies. RESULTS The extensive comprehensive results demonstrate that our method achieves superior performance in various segmentation tasks on different data scales even with fewer than a million parameters. Results reveal that our PMFSNet achieves IoU of 84.68%, 82.02%, 78.82%, and 76.48% on public datasets of 3D CBCT Tooth, ovarian tumors ultrasound (MMOTU), skin lesions dermoscopy (ISIC 2018), and gastrointestinal polyp (Kvasir SEG), and yields DSC of 78.29%, 77.45%, and 78.04% on three retinal vessel segmentation datasets, DRIVE, STARE, and CHASE-DB1, respectively. CONCLUSION Our proposed model exhibits competitive performance across various datasets, accomplishing this with significantly fewer model parameters and inference time, demonstrating its value in model integration and deployment. It strikes an optimal compromise between efficiency and performance and can be a highly efficient solution for medical image analysis in resource-constrained clinical environments. The source code is available at https://github.com/yykzjh/PMFSNet.
Collapse
Affiliation(s)
- Jiahui Zhong
- School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, 610054, PR China.
| | - Wenhong Tian
- School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, 610054, PR China.
| | - Yuanlun Xie
- School of Electronic Information and Electrical Engineering, Chengdu University, Chengdu 610106, China.
| | - Zhijia Liu
- School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, 610054, PR China.
| | - Jie Ou
- School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, 610054, PR China.
| | - Taoran Tian
- State Key Laboratory of Oral Diseases, National Clinical Research Center for Oral Diseases, West China Hospital of Stomatology, Sichuan University, Chengdu, 610041, PR China.
| | - Lei Zhang
- School of Computer Science, University of Lincoln, LN6 7TS, UK.
| |
Collapse
|
45
|
Zhong Y, Huang Y, Hu J, Zhang Y, Ji R. Towards Accurate Post-Training Quantization of Vision Transformers via Error Reduction. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:2676-2692. [PMID: 40031001 DOI: 10.1109/tpami.2025.3528042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Post-training quantization (PTQ) for vision transformers (ViTs) has received increasing attention from both academic and industrial communities due to its minimal data needs and high time efficiency. However, many current methods fail to account for the complex interactions between quantized weights and activations, resulting in significant quantization errors and suboptimal performance. This paper presents ERQ, an innovative two-step PTQ method specifically crafted to reduce quantization errors arising from activation and weight quantization sequentially. The first step, Activation quantization error reduction (Aqer), first applies Reparameterization Initialization aimed at mitigating initial quantization errors in high-variance activations. Then, it further mitigates the errors by formulating a Ridge Regression problem, which updates the weights maintained at full-precision using a closed-form solution. The second step, Weight quantization error reduction (Wqer), first applies Dual Uniform Quantization to handle weights with numerous outliers, which arise from adjustments made during Reparameterization Initialization, thereby reducing initial weight quantization errors. Then, it employs an iterative approach to further tackle the errors. In each iteration, it adopts Rounding Refinement that uses an empirically derived, efficient proxy to refine the rounding directions of quantized weights, complemented by a Ridge Regression solver to reduce the errors. Comprehensive experimental results demonstrate ERQ's superior performance across various ViTs variants and tasks. For example, ERQ surpasses the state-of-the-art GPTQ by a notable 36.81% in accuracy for W3A4 ViT-S.
Collapse
|
46
|
Jiang J, Zhong Y, Yang R, Quan W, Yan DM. MPIC: Exploring alternative approach to standard convolution in deep neural networks. Neural Netw 2025; 184:107082. [PMID: 39754840 DOI: 10.1016/j.neunet.2024.107082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 11/23/2024] [Accepted: 12/19/2024] [Indexed: 01/06/2025]
Abstract
In the rapidly evolving field of deep learning, Convolutional Neural Networks (CNNs) retain their unique strengths and applicability in processing grid-structured data such as images, despite the surge of Transformer architectures. This paper explores alternatives to the standard convolution, with the objective of augmenting its feature extraction prowess while maintaining a similar parameter count. We propose innovative solutions targeting depthwise separable convolution and standard convolution, culminating in our Multi-scale Progressive Inference Convolution (MPIC). MPIC incorporates the benefits of large receptive fields, multi-scale processing, and gradual inference. Our alternative Approach are not only compatible with existing convolutional variant networks such as MobileNet, ResNet, and ResNest, but also significantly enhance feature extraction capabilities while retaining computational efficiency. Comprehensive experiments on several renowned datasets and in-depth comparisons with standard convolution validate the efficacy of our proposals. The results exhibit significant performance enhancements with our convolutional alternatives. Detailed ablation studies further corroborate the effectiveness of our proposed solutions in various computer vision tasks, including object detection, class activation mapping, and salient object detection, etc.
Collapse
Affiliation(s)
- Jie Jiang
- National University of Defense Technology, Department of Systems Engineering, the Laboratory for Big Data and Decision, Changsha, 410073, China
| | - Yi Zhong
- National University of Defense Technology, Department of Systems Engineering, the Laboratory for Big Data and Decision, Changsha, 410073, China.
| | - Ruoli Yang
- National University of Defense Technology, Department of Systems Engineering, the Laboratory for Big Data and Decision, Changsha, 410073, China
| | - Weize Quan
- Institute of Automation, Chinese Academy of Sciences, MAIS, Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 101408, China
| | - Dong-Ming Yan
- Institute of Automation, Chinese Academy of Sciences, MAIS, Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 101408, China
| |
Collapse
|
47
|
Xie J, Zheng J, Fang W, Cai Y, Li Q. Explicitly diverse visual question generation. Neural Netw 2025; 184:107002. [PMID: 39709645 DOI: 10.1016/j.neunet.2024.107002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 11/26/2024] [Accepted: 12/01/2024] [Indexed: 12/24/2024]
Abstract
Visual question generation involves the generation of meaningful questions about an image. Although we have made significant progress in automatically generating a single high-quality question related to an image, existing methods often ignore the diversity and interpretability of generated questions, which are important for various daily tasks that require clear question sources. In this paper, we propose an explicitly diverse visual question generation model that aims to generate diverse questions based on interpretable question sources. To explicitly perform question generation, our model first extracts the scene graph from the image using the unbiased scene graph generation method, where questions generated based on the scene graphs have interpretable question sources. To ensure the diversity of generated questions, our model selects different subgraphs from the scene graph as question sources. Specifically, we employ a subgraph selector to learn how humans select multiple subgraphs that are suitable for question generation. Finally, our model generates diverse questions based on different selected subgraphs. Extensive experiments on the VQA v2.0 and COCO-QA datasets show that the proposed model outperforms the baselines and is able to interpretably generate diverse questions.
Collapse
Affiliation(s)
- Jiayuan Xie
- Department of Computing, Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Jiasheng Zheng
- Institute of Software Chinese Academy of Sciences, Beijing, China
| | - Wenhao Fang
- School of Software Engineering, South China University of Technology, Guangzhou, China; Key Laboratory of Big Data and Intelligent Robot (South China University of Technology), Ministry of Education, Guangzhou, China
| | - Yi Cai
- School of Software Engineering, South China University of Technology, Guangzhou, China; Key Laboratory of Big Data and Intelligent Robot (South China University of Technology), Ministry of Education, Guangzhou, China.
| | - Qing Li
- Department of Computing, Hong Kong Polytechnic University, Hong Kong SAR, China
| |
Collapse
|
48
|
Zhang H, Yang T, Wang H, Fan J, Zhang W, Ji M. FDuDoCLNet: Fully dual-domain contrastive learning network for parallel MRI reconstruction. Magn Reson Imaging 2025; 117:110336. [PMID: 39864600 DOI: 10.1016/j.mri.2025.110336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2024] [Revised: 12/28/2024] [Accepted: 01/23/2025] [Indexed: 01/28/2025]
Abstract
Magnetic resonance imaging (MRI) is a non-invasive medical imaging technique that is widely used for high-resolution imaging of soft tissues and organs. However, the slow speed of MRI imaging, especially in high-resolution or dynamic scans, makes MRI reconstruction an important research topic. Currently, MRI reconstruction methods based on deep learning (DL) have garnered significant attention, and they improve the reconstruction quality by learning complex image features. However, DL-based MR image reconstruction methods exhibit certain limitations. First, the existing reconstruction networks seldom account for the diverse frequency features in the wavelet domain. Second, existing dual-domain reconstruction methods may pay too much attention to the features of a single domain (such as the global information in the image domain or the local details in the wavelet domain), resulting in the loss of either critical global structures or fine details in certain regions of the reconstructed image. In this work, inspired by the lifting scheme in wavelet theory, we propose a novel Fully Dual-Domain Contrastive Learning Network (FDuDoCLNet) based on variational networks (VarNet) for accelerating PI in both the image and wavelet domains. It is composed of several cascaded dual-domain regularization units and data consistency (DC) layers, in which a novel dual-domain contrastive loss is introduced to optimize the reconstruction performance effectively. The proposed FDuDoCLNet was evaluated on the publicly available fastMRI multi-coil knee dataset under a 6× acceleration factor, achieving a PSNR of 34.439 dB and a SSIM of 0.895.
Collapse
Affiliation(s)
- Huiyao Zhang
- School of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China
| | - Tiejun Yang
- School of Artificial Intelligence and Big Data, Henan University of Technology, Zhengzhou 450001, China; Key Laboratory of Grain Information Processing and Control (HAUT), Ministry of Education, Zhengzhou, China; Henan Key Laboratory of Grain Photoelectric Detection and Control (HAUT), Zhengzhou, Henan, China.
| | - Heng Wang
- School of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China
| | - Jiacheng Fan
- School of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China
| | - Wenjie Zhang
- School of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China
| | - Mingzhu Ji
- School of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China
| |
Collapse
|
49
|
Santer RD, Allen WL. Insect visual perception and pest control: opportunities and challenges. CURRENT OPINION IN INSECT SCIENCE 2025; 68:101331. [PMID: 39827991 DOI: 10.1016/j.cois.2025.101331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 11/20/2024] [Accepted: 01/10/2025] [Indexed: 01/22/2025]
Abstract
Humans and insects inhabit very different perceptual worlds, so human experimenters need to be aware of their perceptual biases when investigating insect behaviour. In applied entomology, human perceptual biases have been a barrier to the rational design, manufacture, and improvement of pest control devices that effectively exploit insect visual behaviour. This review describes how the influence of human perceptual bias on this area of applied entomology is being reduced by our expanding understanding of insect visual perception and use of visual modelling methods and highlights several important challenges that are yet to be overcome.
Collapse
Affiliation(s)
- Roger D Santer
- Department of Life Sciences, Aberystwyth University, Aberystwyth, UK.
| | | |
Collapse
|
50
|
Laumer F, Rubi L, Matter MA, Buoso S, Fringeli G, Mach F, Ruschitzka F, Buhmann JM, Matter CM. 2D echocardiography video to 3D heart shape reconstruction for clinical application. Med Image Anal 2025; 101:103434. [PMID: 39740474 DOI: 10.1016/j.media.2024.103434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 08/01/2024] [Accepted: 12/07/2024] [Indexed: 01/02/2025]
Abstract
Transthoracic Echocardiography (TTE) is a crucial tool for assessing cardiac morphology and function quickly and non-invasively without ionising radiation. However, the examination is subject to intra- and inter-user variability and recordings are often limited to 2D imaging and assessments of end-diastolic and end-systolic volumes. We have developed a novel, fully automated machine learning-based framework to generate a personalised 4D (3D plus time) model of the left ventricular (LV) blood pool with high temporal resolution. A 4D shape is reconstructed from specific 2D echocardiographic views employing deep neural networks, pretrained on a synthetic dataset, and fine-tuned in a self-supervised manner using a novel optimisation method for cross-sectional imaging data. No 3D ground truth is needed for model training. The generated digital twins enhance the interpretation of TTE data by providing a versatile tool for automated analysis of LV volume changes, localisation of infarct areas, and identification of new and clinically relevant biomarkers. Experiments are performed on a multicentre dataset that includes TTE exams of 144 patients with normal TTE and 314 patients with acute myocardial infarction (AMI). The novel biomarkers show a high predictive value for survival (area under the curve (AUC) of 0.82 for 1-year all-cause mortality), demonstrating that personalised 3D shape modelling has the potential to improve diagnostic accuracy and risk assessment.
Collapse
Affiliation(s)
- Fabian Laumer
- ETH Zürich, Institute for Machine Learning, Zürich, Switzerland.
| | - Lena Rubi
- ETH Zürich, Institute for Machine Learning, Zürich, Switzerland
| | - Michael A Matter
- University Hospital Zurich and University of Zurich, Center for Translational and Experimental Cardiology, Zürich, Switzerland
| | - Stefano Buoso
- ETH Zürich, Institute for Biomedical Engineering, Zürich, Switzerland
| | | | - François Mach
- Geneva University Hospital, Cardiology, Geneva, Switzerland
| | - Frank Ruschitzka
- University Hospital Zurich and University of Zurich, Center for Translational and Experimental Cardiology, Zürich, Switzerland
| | | | - Christian M Matter
- University Hospital Zurich and University of Zurich, Center for Translational and Experimental Cardiology, Zürich, Switzerland
| |
Collapse
|