1
|
Lee J, Duong PN, Lee H. Configurable Encryption and Decryption Architectures for CKKS-Based Homomorphic Encryption. Sensors (Basel) 2023; 23:7389. [PMID: 37687844 PMCID: PMC10490559 DOI: 10.3390/s23177389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 08/19/2023] [Accepted: 08/21/2023] [Indexed: 09/10/2023]
Abstract
With the increasing number of edge devices connecting to the cloud for storage and analysis, concerns about security and data privacy have become more prominent. Homomorphic encryption (HE) provides a promising solution by not only preserving data privacy but also enabling meaningful computations on encrypted data; while considerable efforts have been devoted to accelerating expensive homomorphic evaluation in the cloud, little attention has been paid to optimizing encryption and decryption (ENC-DEC) operations on the edge. In this paper, we propose efficient hardware architectures for CKKS-based ENC-DEC accelerators to facilitate computations on the client side. The proposed architectures are configurable to support a wide range of polynomial sizes with multiplicative depths (up to 30 levels) at a 128-bit security guarantee. We evaluate the hardware designs on the Xilinx XCU250 FPGA platform and achieve an average encryption time 23.7× faster than that of the well-known SEAL HE library. By reducing time complexity and improving the hardware utilization of cryptographic algorithms, our configurable CKKS-supported ENC-DEC hardware designs have the potential to greatly accelerate cryptographic processes on the client side in the post-quantum era.
Collapse
Affiliation(s)
- Jaehyeok Lee
- Department of Electrical and Computer Engineering, Inha University, Incheon 22212, Republic of Korea; (J.L.); (P.N.D.)
| | - Phap Ngoc Duong
- Department of Electrical and Computer Engineering, Inha University, Incheon 22212, Republic of Korea; (J.L.); (P.N.D.)
- Faculty of Computer Engineering and Electronics, The University of Danang–Vietnam-Korea University of Information and Communication Technology, Danang 50000, Vietnam
| | - Hanho Lee
- Department of Electrical and Computer Engineering, Inha University, Incheon 22212, Republic of Korea; (J.L.); (P.N.D.)
| |
Collapse
|
2
|
Lam DK, Du CV, Pham HL. QuantLaneNet: A 640-FPS and 34-GOPS/W FPGA-Based CNN Accelerator for Lane Detection. Sensors (Basel) 2023; 23:6661. [PMID: 37571445 PMCID: PMC10422460 DOI: 10.3390/s23156661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 07/14/2023] [Accepted: 07/18/2023] [Indexed: 08/13/2023]
Abstract
Lane detection is one of the most fundamental problems in the rapidly developing field of autonomous vehicles. With the dramatic growth of deep learning in recent years, many models have achieved a high accuracy for this task. However, most existing deep-learning methods for lane detection face two main problems. First, most early studies usually follow a segmentation approach, which requires much post-processing to extract the necessary geometric information about the lane lines. Second, many models fail to reach real-time speed due to the high complexity of model architecture. To offer a solution to these problems, this paper proposes a lightweight convolutional neural network that requires only two small arrays for minimum post-processing, instead of segmentation maps for the task of lane detection. This proposed network utilizes a simple lane representation format for its output. The proposed model can achieve 93.53% accuracy on the TuSimple dataset. A hardware accelerator is proposed and implemented on the Virtex-7 VC707 FPGA platform to optimize processing time and power consumption. Several techniques, including data quantization to reduce data width down to 8-bit, exploring various loop-unrolling strategies for different convolution layers, and pipelined computation across layers, are optimized in the proposed hardware accelerator architecture. This implementation can process at 640 FPS while consuming only 10.309 W, equating to a computation throughput of 345.6 GOPS and energy efficiency of 33.52 GOPS/W.
Collapse
Affiliation(s)
- Duc Khai Lam
- Computer Engineering Department, University of Information Technology, Ho Chi Minh City 700000, Vietnam;
- Vietnam National University, Ho Chi Minh City 700000, Vietnam
| | - Cam Vinh Du
- Computer Engineering Department, University of Information Technology, Ho Chi Minh City 700000, Vietnam;
- Vietnam National University, Ho Chi Minh City 700000, Vietnam
| | - Hoai Luan Pham
- Graduate School of Information Science, Nara Institute of Science and Technology, Nara 630-0192, Japan;
| |
Collapse
|
3
|
Gookyi DAN, Ryoo K. A Lightweight System-On-Chip Based Cryptographic Core for Low-Cost Devices. Sensors (Basel) 2022; 22:3004. [PMID: 35458989 DOI: 10.3390/s22083004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Revised: 04/04/2022] [Accepted: 04/11/2022] [Indexed: 02/05/2023]
Abstract
The backbone of the Internet of things (IoT) platform consists of tiny low-cost devices that are continuously exchanging data. These devices are usually limited in terms of hardware footprint, memory capacity, and processing power. The devices are usually insecure because implementing standard cryptographic algorithms requires the use of a large hardware footprint which leads to an increase in the prices of devices. This study implements a System-on-Chip (SoC) based lightweight cryptographic core that consists of two encryption protocols, four authentication protocols, and a key generation/exchange protocol for ultra-low-cost devices. The hardware architectures use the concept of resource sharing to minimize the hardware area. The lightweight cryptographic SoC is tested by designing a desktop software application to serve as an interface to the hardware. The design is implemented using Verilog HDL and the 130 nm CMOS cell library is used for synthesis, which results in 33 k gate equivalents at a maximum clock frequency of 50 MHz.
Collapse
|
4
|
Tekleyohannes MK, Rybalkin V, Ghaffar MM, Varela JA, Wehn N, Dengel A. iDocChip: A Configurable Hardware Accelerator for an End-to-End Historical Document Image Processing. J Imaging 2021; 7:175. [PMID: 34564101 PMCID: PMC8467298 DOI: 10.3390/jimaging7090175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2021] [Revised: 08/28/2021] [Accepted: 08/29/2021] [Indexed: 11/16/2022] Open
Abstract
In recent years, there has been an increasing demand to digitize and electronically access historical records. Optical character recognition (OCR) is typically applied to scanned historical archives to transcribe them from document images into machine-readable texts. Many libraries offer special stationary equipment for scanning historical documents. However, to digitize these records without removing them from where they are archived, portable devices that combine scanning and OCR capabilities are required. An existing end-to-end OCR software called anyOCR achieves high recognition accuracy for historical documents. However, it is unsuitable for portable devices, as it exhibits high computational complexity resulting in long runtime and high power consumption. Therefore, we have designed and implemented a configurable hardware-software programmable SoC called iDocChip that makes use of anyOCR techniques to achieve high accuracy. As a low-power and energy-efficient system with real-time capabilities, the iDocChip delivers the required portability. In this paper, we present the hybrid CPU-FPGA architecture of iDocChip along with the optimized software implementations of the anyOCR. We demonstrate our results on multiple platforms with respect to runtime and power consumption. The iDocChip system outperforms the existing anyOCR by 44× while achieving 2201× higher energy efficiency and a 3.8% increase in recognition accuracy.
Collapse
Affiliation(s)
- Menbere Kina Tekleyohannes
- Microelectronic Systems Design Research Group, University of Kaiserslautern, 67663 Kaiserslautern, Germany; (M.M.G.); (J.A.V.); (N.W.)
| | - Vladimir Rybalkin
- Microelectronic Systems Design Research Group, University of Kaiserslautern, 67663 Kaiserslautern, Germany; (M.M.G.); (J.A.V.); (N.W.)
| | - Muhammad Mohsin Ghaffar
- Microelectronic Systems Design Research Group, University of Kaiserslautern, 67663 Kaiserslautern, Germany; (M.M.G.); (J.A.V.); (N.W.)
| | - Javier Alejandro Varela
- Microelectronic Systems Design Research Group, University of Kaiserslautern, 67663 Kaiserslautern, Germany; (M.M.G.); (J.A.V.); (N.W.)
| | - Norbert Wehn
- Microelectronic Systems Design Research Group, University of Kaiserslautern, 67663 Kaiserslautern, Germany; (M.M.G.); (J.A.V.); (N.W.)
| | - Andreas Dengel
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany;
| |
Collapse
|
5
|
Zhao Y, Lu J, Chen X. An Accelerator Design Using a MTCA Decomposition Algorithm for CNNs. Sensors (Basel) 2020; 20:s20195558. [PMID: 32998366 PMCID: PMC7583864 DOI: 10.3390/s20195558] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 09/17/2020] [Accepted: 09/26/2020] [Indexed: 11/24/2022]
Abstract
Due to the high throughput and high computing capability of convolutional neural networks (CNNs), researchers are paying increasing attention to the design of CNNs hardware accelerator architecture. Accordingly, in this paper, we propose a block parallel computing algorithm based on the matrix transformation computing algorithm (MTCA) to realize the convolution expansion and resolve the block problem of the intermediate matrix. It enables high parallel implementation on hardware. Moreover, we also provide a specific calculation method for the optimal partition of matrix multiplication to optimize performance. In our evaluation, our proposed method saves more than 60% of hardware storage space compared with the im2col(image to column) approach. More specifically, in the case of large-scale convolutions, it saves nearly 82% of storage space. Under the accelerator architecture framework designed in this paper, we realize the performance of 26.7GFLOPS-33.4GFLOPS (depending on convolution type) on FPGA(Field Programmable Gate Array) by reducing bandwidth and improving data reusability. It is 1.2×–4.0× faster than memory-efficient convolution (MEC) and im2col, respectively, and represents an effective solution for a large-scale convolution accelerator.
Collapse
|
6
|
Al Koutayni MR, Rybalkin V, Malik J, Elhayek A, Weis C, Reis G, Wehn N, Stricker D. Real-Time Energy Efficient Hand Pose Estimation: A Case Study. Sensors (Basel) 2020; 20:E2828. [PMID: 32429341 DOI: 10.3390/s20102828] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/14/2020] [Revised: 05/04/2020] [Accepted: 05/12/2020] [Indexed: 01/09/2023]
Abstract
The estimation of human hand pose has become the basis for many vital applications where the user depends mainly on the hand pose as a system input. Virtual reality (VR) headset, shadow dexterous hand and in-air signature verification are a few examples of applications that require to track the hand movements in real-time. The state-of-the-art 3D hand pose estimation methods are based on the Convolutional Neural Network (CNN). These methods are implemented on Graphics Processing Units (GPUs) mainly due to their extensive computational requirements. However, GPUs are not suitable for the practical application scenarios, where the low power consumption is crucial. Furthermore, the difficulty of embedding a bulky GPU into a small device prevents the portability of such applications on mobile devices. The goal of this work is to provide an energy efficient solution for an existing depth camera based hand pose estimation algorithm. First, we compress the deep neural network model by applying the dynamic quantization techniques on different layers to achieve maximum compression without compromising accuracy. Afterwards, we design a custom hardware architecture. For our device we selected the FPGA as a target platform because FPGAs provide high energy efficiency and can be integrated in portable devices. Our solution implemented on Xilinx UltraScale+ MPSoC FPGA is 4.2× faster and 577.3× more energy efficient than the original implementation of the hand pose estimation algorithm on NVIDIA GeForce GTX 1070.
Collapse
|
7
|
Abstract
Single-pass connected components analysis (CCA) algorithms suffer from a time overhead to resolve labels at the end of each image row. This work demonstrates how this overhead can be eliminated by replacing the conventional raster scan by a zig-zag scan. This enables chains of labels to be correctly resolved while processing the next image row. The effect is faster processing in the worst case with no end of row overheads. CCA hardware architectures using the novel algorithm proposed in this paper are, therefore, able to process images at higher throughput than other state-of-the-art methods while reducing the hardware requirements. The latency introduced by the conversion from raster scan to zig-zag scan is compensated for by a new method of detecting object completion, which enables the feature vector for completed connected components to be output at the earliest possible opportunity.
Collapse
Affiliation(s)
- Donald G. Bailey
- Department of Mechanical and Electrical Engineering, School of Food and Advanced Technology, Massey University, Palmerston North 4442, New Zealand
- Correspondence:
| | | |
Collapse
|
8
|
Zhou H, Machupalli R, Mandal M. Efficient FPGA Implementation of Automatic Nuclei Detection in Histopathology Images. J Imaging 2019; 5:jimaging5010021. [PMID: 34465711 PMCID: PMC8320863 DOI: 10.3390/jimaging5010021] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2018] [Revised: 12/27/2018] [Accepted: 01/11/2019] [Indexed: 11/17/2022] Open
Abstract
Accurate and efficient detection of cell nuclei is an important step towards the development of a pathology-based Computer Aided Diagnosis. Generally, high-resolution histopathology images are very large, in the order of billion pixels, therefore nuclei detection is a highly compute intensive task, and software implementation requires a significant amount of processing time. To assist the doctors in real time, special hardware accelerators, which can reduce the processing time, are required. In this paper, we propose a Field Programmable Gate Array (FPGA) implementation of automated nuclei detection algorithm using generalized Laplacian of Gaussian filters. The experimental results show that the implemented architecture has the potential to provide a significant improvement in processing time without losing detection accuracy.
Collapse
|
9
|
Jiang G, Liu L, Zhu W, Yin S, Wei S. A 181 GOPS AKAZE Accelerator Employing Discrete-Time Cellular Neural Networks for Real-Time Feature Extraction. Sensors (Basel) 2015; 15:22509-22529. [PMID: 26404305 PMCID: PMC4610552 DOI: 10.3390/s150922509] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/17/2015] [Revised: 08/20/2015] [Accepted: 08/25/2015] [Indexed: 06/05/2023]
Abstract
This paper proposes a real-time feature extraction VLSI architecture for high-resolution images based on the accelerated KAZE algorithm. Firstly, a new system architecture is proposed. It increases the system throughput, provides flexibility in image resolution, and offers trade-offs between speed and scaling robustness. The architecture consists of a two-dimensional pipeline array that fully utilizes computational similarities in octaves. Secondly, a substructure (block-serial discrete-time cellular neural network) that can realize a nonlinear filter is proposed. This structure decreases the memory demand through the removal of data dependency. Thirdly, a hardware-friendly descriptor is introduced in order to overcome the hardware design bottleneck through the polar sample pattern; a simplified method to realize rotation invariance is also presented. Finally, the proposed architecture is designed in TSMC 65 nm CMOS technology. The experimental results show a performance of 127 fps in full HD resolution at 200 MHz frequency. The peak performance reaches 181 GOPS and the throughput is double the speed of other state-of-the-art architectures.
Collapse
Affiliation(s)
- Guangli Jiang
- Institute of Microelectronics, Tsinghua University, Beijing 100084, China.
| | - Leibo Liu
- Institute of Microelectronics, Tsinghua University, Beijing 100084, China.
| | - Wenping Zhu
- Institute of Microelectronics, Tsinghua University, Beijing 100084, China.
| | - Shouyi Yin
- Institute of Microelectronics, Tsinghua University, Beijing 100084, China.
| | - Shaojun Wei
- Institute of Microelectronics, Tsinghua University, Beijing 100084, China.
| |
Collapse
|