Learning in the Frequency Domain

Collaborator:

Project Overview

Convolutional neural networks (CNNs) have revolutionized the computer vision community because of their exceptional performance on various tasks such as image classification, object detection, and semantic segmentation. Constrained by the computing resources and memory limitations, most CNN models only accept RGB images at low resolutions (e.g., 224 ⇥ 224). How- ever, images produced by modern cameras are usually much larger. For example, the high definition (HD) resolution images (1920⇥1080) are considered relatively small by mod- ern standards. Even the average image resolution in the ImageNet dataset is 482⇥415, which is roughly four times the size accepted by most CNN models. Therefore, a large portion of real-world images are aggressively down- sized to 224⇥224 to meet the input requirement of classification networks. However, image downsizing inevitably incurs information loss and accuracy degradation. Prior works aim to reduce information loss by learning task-aware downsizing networks. However, those networks are task-specific and require additional computation, which are not favorable in practical applications. In this research, we propose to reshape the high-resolution images in the frequency domain, i.e., discrete cosine transform (DCT) do- main 1, rather than resizing them in the spatial domain, and then feed the reshaped DCT coefficients to CNN models for inference. Our method requires little modification to the existing CNN models that take RGB images as in- put. Thus, it is a universal replacement for the routine data pre-processing pipelines. We demonstrate that our method achieves higher accuracy in image classification, object detection, and instance segmentation tasks than the conventional RGB-based methods with an equal or smaller input data size. The proposed method leads to a direct reduction in the required inter-chip communication bandwidth that is often a bottleneck in modern deep learning inference systems, i.e., the computational throughput of rapidly evolving AI accelerators/GPUs is becoming increasingly higher than the data loading throughput of CPUs, as shown in Figure 1.

Inspired by the observation that human visual system (HVS) has unequal sensitivity to different frequency components, we analyze the image classification, detection and segmentation task in the frequency domain and find that CNN models are more sensitive to low-frequency channels than the high-frequency channels, which coincides with HVS. This observation is validated by a learning-based channel selection method that consists of multiple “on-off switches”. The DCT coefficients with the same frequency are packed as one channel, and each switch is stacked on a specific frequency channel to either allow the entire channel to flow into the network or not.

Using the decoded high-fidelity images for model training and inference has posed significant challenges, from both data transfer and computation perspectives. Due to the spectral bias of the CNN models, one can only keep the important frequency channels during inference without losing accuracy. In this research, we also develop a static channel selection approach to preserve the salient channels rather than using the entire frequency spectrum for inference. Experiment results show that the CNN models still retain the same accuracy when the input data size is reduced by 87.5%.

The contributions of this research are as follows:

We propose a method of learning in the frequency do- main (using DCT coefficients as input), which requires little modification to the existing CNN models that take RGB input. We validate our method on ResNet-50 and MobileNetV2 for the image classification task and Mask R-CNN for the instance segmentation task.
We show that learning in the frequency domain better preserves image information in the pre-processing stage than the conventional spatial downsampling approach (spatially resizing the images to 224⇥224, the default input size of most CNN models) and consequently achieves improved accuracy, i.e., +1.60% on ResNet-50 and +0.63% on MobileNetV2 for the ImageNet classification task, +0.8% on Mask R-CNN for both object detection and instance segmentation tasks.
We analyze the spectral bias from the frequency perspective and show that the CNN models are more sensitive to low-frequency channels than high-frequency channels, similar to the human visual system (HVS).
We propose a learning-based dynamic channel selection method to identify the trivial frequency components for static removal during inference. Experiment results on ResNet-50 show that one can prune up to 87.5% of the frequency channels using the proposed channel selection method with no or little accuracy degradation in the ImageNet classification task.
To the best of our knowledge, this is the first work that explores learning in the frequency domain for object detection and instance segmentation. Experiment results on Mask R-CNN show that learning in the frequency domain can achieve a 0.8% average precision improvement for the instance segmentation task on the COCO dataset.

Figure 1: (a) The workflow of the conventional CNN-based methods using RGB images as input. (b) The workflow of the proposed method using DCT coefficients as input. CB represents the required communication bandwidth between CPU and GPU/accelerator.

Figure 2: The data pre-processing pipeline for learning in the frequency domain.

Figure 3: Connecting the pre-processed input features in the frequency domain to ResNet-50. The three input layers (the dashed gray blocks) in a vanilla ResNet-50 are removed to admit the 56⇥56⇥64 DCT inputs. We take 64 channels as an example. This value can vary based on the channel selection. In learning-based channel selection, all 192 channels are analyzed for their importance to accuracy, based on which only a subset (<<192 channels) is used in the static selection approach.

Figure 4: The gate module that generates the binary decisions based on the features extracted by the SE-Block. The white color channels of Tensor 5 indicate the unselected channels.

Figure 5: A heat map visualization of input frequency channels on the ImageNet validation dataset for image classification and COCO validation dataset for instance segmentation. The numbers in each square represent the corresponding channel indices. The color from bright to dark indicates the possibility of a channel being selected from low to high.

Table 1: ResNet-50 classification results on ImageNet (validation). The input size of each method is normalized over the baseline ResNet-50. The input frequency channels are selected with the square and triangle channel selection pattern if the postfix S and T is specified, respectively.

Table 2: MobileNetV2 classification results on ImageNet (validation).

Table 3: Bbox AP results of Mask R-CNN using different backbones on COCO 2017 validation set. The baseline Mask R- CNN uses a ResNet-50-FPN as the backbone. The DCT method uses the frequency-domain ResNet-50-FPN as the backbone.

Table 4: Mask AP results of Mask R-CNN using different backbones on COCO 2017 validation set.

Figure 6: Examples of instance segmentation results on the COCO dataset.

Figure 7: Examples of instance segmentation results on the COCO dataset.

Table 5: Bbox AP results of Faster R-CNN using different backbones on COCO 2017 validation set. The baseline Mask RCNN use a ResNet-50-FPN as the backbone. The DCT method uses the frequency-domain ResNet-50-FPN as the backbone.

Publications

Conference Proceedings

K. Xu, Qin, M. , Sun, F. , Wang, Y. , Chen, Y. - K. , and Ren, F. , “Learning in the Frequency Domain”, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, pp. 1740-1749, 2020.

(4.98 MB) Conference Proceedings

Acknowledgements

The work by Arizona State University is supported by an NSF grant (IIS/CPS-1652038).