A 34-FPS 698-GOP/s/W Binarized Deep Neural Network-based Natural Scene Text Interpretation Accelerator for Mobile Edge Computing

TitleA 34-FPS 698-GOP/s/W Binarized Deep Neural Network-based Natural Scene Text Interpretation Accelerator for Mobile Edge Computing
Publication TypeJournal Article
Year of Publication2019
AuthorsLi, YI, Liu, Z, Liu, W, Jiang, Y, Wang, Y, Goh, WLing, Yu, H, Ren, F
JournalIEEE Transactions on Industrial Electronics (TIE)
Volume66
Issue9
Pagination7407-7416
Date Published10/2018
Keywords (or New Research Field)psclab
Abstract

The scene text interpretation is a critical part of natural scene interpretation. Currently, most of the existing work is based on high-end GPU implementation, which is commonly used on the server side. However, in IoT application scenarios, the communication overhead from the edge device to the server is quite large, which sometimes even dominates the total processing time. Hence, the edgecomputing oriented design is needed to solve this problem. In this paper, we present an architectural design and implementation of a natural scene text interpretation (NSTI) accelerator, which can classify and localize the text region on pixel-level efficiently in real-time on mobile devices. To target the real-time and low-latency processing, the Binary Convolutional Encoder-decoder Network (B-CEDNet) is adopted as the core architecture to enable massive parallelism due to its binary feature. Massively parallelized computations and a highly pipelined data flow control enhance its latency and throughput performance. In addition, all the binarized intermediate results and parameters are stored on chip to eliminate the power consumption and latency overhead of the off-chip communication. The NSTI accelerator is implemented in a 40nm CMOS technology, which can process scene text images (size of 128x32) at 34 fps and latency of 40 ms for pixelwise interpretation with the pixelwise classification accuracy over 90% on ICDAR- 03 and ICDAR-13 dataset. The real energy-efficiency is 698 GOP/s/W and the peak energy-efficiency can get up to 7825 GOP/s/W. The proposed accelerator is 7 more energy efficient than its optimized GPU-based implementation counterpart, while maintaining a real-time throughput with latency of 40 ms.

URLhttps://ieeexplore.ieee.org/document/8513982
DOI10.1109/TIE.2018.2875643
File Attachment: