A Binary Convolutional Encoder-decoder Network for Real-time Natural Scene Text Processing

Project Overview

The success of convolutional neural network (CNN) has resulted in a potential general machine learning engine for various computer vision applications (LeCun et al. 1998; Krizhevsky, Sutskever, and Hinton 2012), such as text detection, recognition and interpretation from images. Applications, such as Advanced Driver Assistance System (ADAS) for road signs with text, however, require a real-time processing capability that is beyond the existing approaches (Jaderberg et al. 2014; Jaderberg, Vedaldi, and Zisserman 2014) in terms of processing functionality, efficiency and latency.

For a real-time scene text recognition application, one needs a method with memory efficiency and fast processing time. In this work, we reveal that binary features (Courbariaux and Bengio 2016) can effectively and efficiently represent the scene text image. Combining with deconvolution technique, we introduce a binary convolutional encoderdecoder network (B-CEDNet) for real-time one-shot character detection and recognition. The scene text recognition is further enhanced with a back-end character-level sequential correction and classification, based on a bidirectional recurrent neural network (Bi-RNN). Instead of detecting characters sequentially (Bissacco et al. 2013;Wang et al. 2012; Shi, Bai, and Yao 2015), our proposed method, called Squeezed- Text, can detect multiple characters simultaneously and extracts a length-variable character sequence with corresponding spatial information. This sequence will be subsequently fed into a Bi-RNN, which then learns the detection error characteristics from the previous stage to provides characterlevel correction and classification based on the spatial and contextual cues.

By training with over 1,000,000 synthetic scene text images, the proposed SqueezedText can achieve recall rate of 0.86, precision of 0.88 and F-score of 0.87 on ICDAR-03 (Lucas et al. 2003) dataset. More importantly, it achieves state-of-the-art accuracy of 93.8%, 92.7%, 94.3%96.1%and 83.6% on ICDAR-03, ICDAR-13, IIIT5K, STV and Synthe90K datasets. SqueezedText is realized on GPU with a small network size of 1.01 MB for B-CEDNet and 3.23 MB for Bi-RNN; and consumes less than 1 ms inference runtime on average. It is up to 4 faster and 6 smaller than state-of-the-art work.

The contributions of this work are summarized as follows:

We propose a novel binary convolutional encoder-decoder neural network model, which acts as a visual front-end module to provide unconstrained scene text detection and recognition. It effectively detects individual character with high recall rate, realizing an extremely fast run-time speed and small memory consumption.
We reveal that the text features can be learned and encoded in binary format without loss of discriminative information. This information can be further decoded and recovered to perform multi-character detection and recognition in parallel.
We further design a back-end bidirectional RNN (Bi- RNN) to provide fast and robust scene text recognition with correction and classification.

Figure 1: SqueezedText overview: The B-CEDNet produces salience maps for each character which reveal their category and spatial information. Thresholding and morphologic filtering find the position and size of character region which will be organized to a vector sequence for contextual correction and text classification provided by Bi-RNN.

Figure 2: The architecture of Binary Convolutional Encoder-decoder Network (B-CEDNet).

Figure 3: Bi-RNN architecture for contextual text correction and classification: The “update” gate decides which element in the sequence to be accepted to update the state Ct based on the category and spatial information in Ut; and “reset” gate determines where is the end of a word.

Figure 4: Examples in the synthetic data set for training of B-CEDNet. There are 1 million training images with pixel-wise labels.

Figure 5: Visualization of binary activation of each convolutional block as well as the generated salience maps and bounding boxes.

Figure 6: The trade-off between confidence threshold and character retrieval performance.

Figure 7: Test images and corresponding salience maps and predictions. In salience map, high confidence text region are rendered with red and white colors. The pixel-wise predictions are labeled with different colors.

Figure 8: Run-time comparison between B-CEDNet and it full-precision version (CEDNet).

Figure 9: Accuracy comparison of existing scene text recognition approaches.

Figure 10: Storage and speed comparison between B-CEDNet and existing methods.

Publications

Conference Proceedings

Z. Liu, Li, Y. I. , Ren, F. , Yu, H. , and Goh, W. , “SqueezedText: A Real-time Scene Text Recognition by Binary Convolutional Encoder-decoder Network”, The AAAI Conference on Artificial Intelligence (AAAI). New Orleans, Louisana, pp. 7194-7201, 2018.

(1.49 MB) Conference Proceedings

Z. Liu, Li, Y. , Ren, F. , and Yu, H. , “A Binary Convolutional Encoder-decoder Network for Real-time Natural Scene Text Processing”, The 1st International Workshop on Efficient Methods for Deep Neural Networks - Conference on Neural Information Processing Systems (NIPS). 2016.

(773.3 KB) Conference Proceedings

Acknowledgements

The work by Arizona State University is supported by a Cisco Research grant (CG#594589).