1
|
Dong L, Zhang H, Ma J, Xu X, Yang Y, Wu QMJ. CLRNet: A Cross Locality Relation Network for Crowd Counting in Videos. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:6408-6422. [PMID: 36215378 DOI: 10.1109/tnnls.2022.3209918] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
In this article, we propose a new cross locality relation network (CLRNet) to generate high-quality crowd density maps for crowd counting in videos. Specifically, a cross locality relation module (CLRM) is proposed to enhance feature representations by modeling local dependencies of pixels between adjacent frames with an adapted local self-attention mechanism. First, different from the existing methods which measure similarity between pixels by dot product, a new adaptive cosine similarity is advanced to measure the relationship between two positions. Second, the traditional self-attention modules usually integrate the reconstructed features with the same weights for all the positions. However, crowd movement and background changes in a video sequence are uneven in real-life applications. As a consequence, it is inappropriate to treat all the positions in reconstructed features equally. To address this issue, a scene consistency attention map (SCAM) is developed to make CLRM pay more attention to the positions with strong correlations in adjacent frames. Furthermore, CLRM is incorporated into the network in a coarse-to-fine way to further enhance the representational capability of features. Experimental results demonstrate the effectiveness of our proposed CLRNet in comparison to the state-of-the-art methods on four public video datasets. The codes are available at: https://github.com/Amelie01/CLRNet.
Collapse
|
2
|
Lei D, Dong C, Guo H, Ma P, Liu H, Bao N, Kang H, Chen X, Wu Y. A fused multi-subfrequency bands and CBAM SSVEP-BCI classification method based on convolutional neural network. Sci Rep 2024; 14:8616. [PMID: 38616204 PMCID: PMC11016546 DOI: 10.1038/s41598-024-59348-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 04/09/2024] [Indexed: 04/16/2024] Open
Abstract
For the brain-computer interface (BCI) system based on steady-state visual evoked potential (SSVEP), it is difficult to obtain satisfactory classification performance for short-time window SSVEP signals by traditional methods. In this paper, a fused multi-subfrequency bands and convolutional block attention module (CBAM) classification method based on convolutional neural network (CBAM-CNN) is proposed for discerning SSVEP-BCI tasks. This method extracts multi-subfrequency bands SSVEP signals as the initial input of the network model, and then carries out feature fusion on all feature inputs. In addition, CBAM is embedded in both parts of the initial input and feature fusion for adaptive feature refinement. To verify the effectiveness of the proposed method, this study uses the datasets of Inner Mongolia University of Technology (IMUT) and Tsinghua University (THU) to evaluate the performance of the proposed method. The experimental results show that the highest accuracy of CBAM-CNN reaches 0.9813 percentage point (pp). Within 0.1-2 s time window, the accuracy of CBAM-CNN is 0.0201-0.5388 (pp) higher than that of CNN, CCA-CWT-SVM, CCA-SVM, CCA-GNB, FBCCA, and CCA. Especially in the short-time window range of 0.1-1 s, the performance advantage of CBAM-CNN is more significant. The maximum information transmission rate (ITR) of CBAM-CNN is 503.87 bit/min, which is 227.53 bit/min-503.41 bit/min higher than the above six EEG decoding methods. The study further results show that CBAM-CNN has potential application value in SSVEP decoding.
Collapse
Affiliation(s)
- Dongyang Lei
- College of Electric Power, Inner Mongolia University of Technology, Hohhot, 010080, China
- Intelligent Energy Technology and Equipment Engineering Research Centre of Colleges and Universities in Inner Mongolia Autonomous Region, Hohhot, 010051, China
| | - Chaoyi Dong
- College of Electric Power, Inner Mongolia University of Technology, Hohhot, 010080, China.
- Intelligent Energy Technology and Equipment Engineering Research Centre of Colleges and Universities in Inner Mongolia Autonomous Region, Hohhot, 010051, China.
- Engineering Research Center of Large Energy Storage Technology, Ministry of Education, Hohhot, 010080, China.
- Inner Mongolia Academy of Science and Technology, Hohhot, 010010, China.
| | - Hongfei Guo
- Inner Mongolia Academy of Science and Technology, Hohhot, 010010, China.
| | - Pengfei Ma
- College of Electric Power, Inner Mongolia University of Technology, Hohhot, 010080, China
- Intelligent Energy Technology and Equipment Engineering Research Centre of Colleges and Universities in Inner Mongolia Autonomous Region, Hohhot, 010051, China
| | - Huanzi Liu
- College of Electric Power, Inner Mongolia University of Technology, Hohhot, 010080, China
- Intelligent Energy Technology and Equipment Engineering Research Centre of Colleges and Universities in Inner Mongolia Autonomous Region, Hohhot, 010051, China
| | - Naqin Bao
- College of Electric Power, Inner Mongolia University of Technology, Hohhot, 010080, China
- Intelligent Energy Technology and Equipment Engineering Research Centre of Colleges and Universities in Inner Mongolia Autonomous Region, Hohhot, 010051, China
| | - Hongzhuo Kang
- College of Electric Power, Inner Mongolia University of Technology, Hohhot, 010080, China
- Intelligent Energy Technology and Equipment Engineering Research Centre of Colleges and Universities in Inner Mongolia Autonomous Region, Hohhot, 010051, China
| | - Xiaoyan Chen
- College of Electric Power, Inner Mongolia University of Technology, Hohhot, 010080, China
- Intelligent Energy Technology and Equipment Engineering Research Centre of Colleges and Universities in Inner Mongolia Autonomous Region, Hohhot, 010051, China
- Engineering Research Center of Large Energy Storage Technology, Ministry of Education, Hohhot, 010080, China
- Inner Mongolia Academy of Science and Technology, Hohhot, 010010, China
| | - Yi Wu
- Inner Mongolia Academy of Science and Technology, Hohhot, 010010, China
| |
Collapse
|
3
|
Chandio AA, Leghari M, Soomro MA, Nizamani SZ, Memon S. A multiscale feature fusion method for cursive text detection in natural scene images. THE IMAGING SCIENCE JOURNAL 2023. [DOI: 10.1080/13682199.2022.2160861] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Affiliation(s)
- Asghar Ali Chandio
- School of Engineering and Information Technology, University of New South Wales, Canberra, Australia
- Department of Information Technology, Quaid-e-Awam University, Nawabshah, Pakistan
| | - Mehwish Leghari
- Department of Information Technology, Quaid-e-Awam University, Nawabshah, Pakistan
| | - Muhammad Ali Soomro
- Department of Computer Systems Engineering, Quaid-e-Awam University, Nawabshah, Pakistan
| | - Shah Zaman Nizamani
- Department of Information Technology, Quaid-e-Awam University, Nawabshah, Pakistan
| | - Saifullah Memon
- State Key Laboratory of Networking and Switching Technology, BUPT, Beijing, People's Republic of China
| |
Collapse
|
4
|
Bai H, Mao J, Gary Chan SH. A survey on deep learning-based single image crowd counting: Network design, loss function and supervisory signal. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.08.037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
5
|
Offset-decoupled deformable convolution for efficient crowd counting. Sci Rep 2022; 12:12229. [PMID: 35851829 PMCID: PMC9293988 DOI: 10.1038/s41598-022-16415-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Accepted: 07/11/2022] [Indexed: 11/09/2022] Open
Abstract
Crowd counting is considered a challenging issue in computer vision. One of the most critical challenges in crowd counting is considering the impact of scale variations. Compared with other methods, better performance is achieved with CNN-based methods. However, given the limit of fixed geometric structures, the head-scale features are not completely obtained. Deformable convolution with additional offsets is widely used in the fields of image classification and pattern recognition, as it can successfully exploit the potential of spatial information. However, owing to the randomly generated parameters of offsets in network initialization, the sampling points of the deformable convolution are disorderly stacked, weakening the effectiveness of feature extraction. To handle the invalid learning of offsets and the inefficient utilization of deformable convolution, an offset-decoupled deformable convolution (ODConv) is proposed in this paper. It can completely obtain information within the effective region of sampling points, leading to better performance. In extensive experiments, average MAE of 62.3, 8.3, 91.9, and 159.3 are achieved using our method on the ShanghaiTech A, ShanghaiTech B, UCF-QNRF, and UCF_CC_50 datasets, respectively, outperforming the state-of-the-art methods and validating the effectiveness of the proposed ODConv.
Collapse
|
6
|
Liu Y, Wang Z, Shi M, Satoh S, Zhao Q, Yang H. Discovering regression-detection bi-knowledge transfer for unsupervised cross-domain crowd counting. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.04.107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
7
|
Fan Z, Zhang H, Zhang Z, Lu G, Zhang Y, Wang Y. A survey of crowd counting and density estimation based on convolutional neural network. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.02.103] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
8
|
Wang C, Wang Z. Progressive Multi-Scale Vision Transformer for Facial Action Unit Detection. Front Neurorobot 2022; 15:824592. [PMID: 35095460 PMCID: PMC8790567 DOI: 10.3389/fnbot.2021.824592] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Accepted: 12/10/2021] [Indexed: 11/29/2022] Open
Abstract
Facial action unit (AU) detection is an important task in affective computing and has attracted extensive attention in the field of computer vision and artificial intelligence. Previous studies for AU detection usually encode complex regional feature representations with manually defined facial landmarks and learn to model the relationships among AUs via graph neural network. Albeit some progress has been achieved, it is still tedious for existing methods to capture the exclusive and concurrent relationships among different combinations of the facial AUs. To circumvent this issue, we proposed a new progressive multi-scale vision transformer (PMVT) to capture the complex relationships among different AUs for the wide range of expressions in a data-driven fashion. PMVT is based on the multi-scale self-attention mechanism that can flexibly attend to a sequence of image patches to encode the critical cues for AUs. Compared with previous AU detection methods, the benefits of PMVT are 2-fold: (i) PMVT does not rely on manually defined facial landmarks to extract the regional representations, and (ii) PMVT is capable of encoding facial regions with adaptive receptive fields, thus facilitating representation of different AU flexibly. Experimental results show that PMVT improves the AU detection accuracy on the popular BP4D and DISFA datasets. Compared with other state-of-the-art AU detection methods, PMVT obtains consistent improvements. Visualization results show PMVT automatically perceives the discriminative facial regions for robust AU detection.
Collapse
Affiliation(s)
- Chongwen Wang
- School of Computer Science, Beijing Institute of Technology, Beijing, China
| | | |
Collapse
|
9
|
Gao J, Zhao Y. TFE: A Transformer Architecture for Occlusion Aware Facial Expression Recognition. Front Neurorobot 2021; 15:763100. [PMID: 34759808 PMCID: PMC8573424 DOI: 10.3389/fnbot.2021.763100] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 09/13/2021] [Indexed: 11/13/2022] Open
Abstract
Facial expression recognition (FER) in uncontrolled environment is challenging due to various un-constrained conditions. Although existing deep learning-based FER approaches have been quite promising in recognizing frontal faces, they still struggle to accurately identify the facial expressions on the faces that are partly occluded in unconstrained scenarios. To mitigate this issue, we propose a transformer-based FER method (TFE) that is capable of adaptatively focusing on the most important and unoccluded facial regions. TFE is based on the multi-head self-attention mechanism that can flexibly attend to a sequence of image patches to encode the critical cues for FER. Compared with traditional transformer, the novelty of TFE is two-fold: (i) To effectively select the discriminative facial regions, we integrate all the attention weights in various transformer layers into an attention map to guide the network to perceive the important facial regions. (ii) Given an input occluded facial image, we use a decoder to reconstruct the corresponding non-occluded face. Thus, TFE is capable of inferring the occluded regions to better recognize the facial expressions. We evaluate the proposed TFE on the two prevalent in-the-wild facial expression datasets (AffectNet and RAF-DB) and the their modifications with artificial occlusions. Experimental results show that TFE improves the recognition accuracy on both the non-occluded faces and occluded faces. Compared with other state-of-the-art FE methods, TFE obtains consistent improvements. Visualization results show TFE is capable of automatically focusing on the discriminative and non-occluded facial regions for robust FER.
Collapse
Affiliation(s)
- Jixun Gao
- Department of Computer Science, Henan University of Engineering, Zhengzhou, China
| | - Yuanyuan Zhao
- Department of Computer Science, Zhengzhou University of Technology, Zhengzhou, China
| |
Collapse
|
10
|
Zhang B, Wang N, Zhao Z, Abraham A, Liu H. Crowd counting based on attention-guided multi-scale fusion networks. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.04.045] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
11
|
Ku T, Yang Q, Zhang H. Multilevel feature fusion dilated convolutional network for semantic segmentation. INT J ADV ROBOT SYST 2021. [DOI: 10.1177/17298814211007665] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Recently, convolutional neural network (CNN) has led to significant improvement in the field of computer vision, especially the improvement of the accuracy and speed of semantic segmentation tasks, which greatly improved robot scene perception. In this article, we propose a multilevel feature fusion dilated convolution network (Refine-DeepLab). By improving the space pyramid pooling structure, we propose a multiscale hybrid dilated convolution module, which captures the rich context information and effectively alleviates the contradiction between the receptive field size and the dilated convolution operation. At the same time, the high-level semantic information and low-level semantic information obtained through multi-level and multi-scale feature extraction can effectively improve the capture of global information and improve the performance of large-scale target segmentation. The encoder–decoder gradually recovers spatial information while capturing high-level semantic information, resulting in sharper object boundaries. Extensive experiments verify the effectiveness of our proposed Refine-DeepLab model, evaluate our approaches thoroughly on the PASCAL VOC 2012 data set without MS COCO data set pretraining, and achieve a state-of-art result of 81.73% mean interaction-over-union in the validate set.
Collapse
Affiliation(s)
- Tao Ku
- Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, China
- Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang, China
| | - Qirui Yang
- Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, China
- Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Hao Zhang
- Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, China
- Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang, China
| |
Collapse
|