Scalable VLSI Architectures for Sparse Approximation

Project Overview

Digital electronic industry today relies on Nyquist sampling theorem, which requires to double the size (sampling rate) of the signal representation on the Fourier basis to avoid information loss. However, most natural signals have much sparser representation on some other, non-Fourier, orthogonal basis. This implies a large amount of redundancy in Nyquist-sampled data, making compression a necessity prior to storage or transmission. Recent advances in compressive sensing (CS) theory suggest an alternative data acquisition framework that can directly access the signal information in its sparse domain. Compared to the conventional Nyquist framework, the CS framework has several intrinsic advantages. First, random encoding is a universal compression method that can effectively apply to all compressible signals regardless of what their sparse domain is. This is a desirable merit for the data fusion across multiple signal sources. Second, sampling and compression can be performed at the same stage in CS, allowing for a sampling rate that is significantly lower than the Nyquist rate. Therefore, CS has a potential to greatly impact the data acquisition devices that are sensitive to cost, energy consumption, and portability, such as wireless sensor nodes (WSNs) in mobile and wearable applications.

Especially, CS is a promising solution for realizing the on-body WSNs in pervasive health systems towards 24/7 health monitoring. Electrocardiogram (ECG), electromyography (EMG), and electroencephalogram (EEG) signals (collectively referred to as ExG) contain critical information about human body status and are therefore the main targets in health monitoring applications. As shown in Fig. 1, a CS-based wireless health monitoring system includes the on-body WSNs that utilize a unified random encoding scheme to compress different physiological signals to reduce the data size for transmission (thereby saving transmit energy), and a mobile data aggregator that performs real-time signal reconstruction to promote on-site analysis and processing for real-time applications. Such a system has numerous benefits. First, it brings the signal intelligence closer to the user for timely prediction and decision making. This is particularly important for real-time tasks such as arrhythmia and seizure detection, EMG-driven machine actuation, brain-computer interface, etc. Second, by reconstructing the sparse coefficients of the original signal only, the data size for on-site storage or transmission to the cloud can be further reduced. For practical use, the data aggregator is desired to have a sufficient throughput for reconstructing >50 channels of physiological signals (sampled at 1kHz) in real-time. Additionally, to minimize the overhead of adding such a function to a mobile device, the power consumption of the data aggregator is desired to be bounded within 1% of a mobile device’s 2W power budget. This implies a sparse approximation (SA) engine that can support >50kS/s throughput in <20mW of power (see Fig. 1). It is also desirable to have flexibility for varying sparsity parameters, orthogonal basis, and the number of channels. Such a set of specifications imposes significant challenges to the hardware implementation.

The first challenge is the complexity of SA algorithms. SA is an optimization problem that involves complex operations in an iterative process with intensive memory access. Compared to the orthogonal transformations used in the Nyquist framework, SA algorithms have greater computational complexity (CC) and higher data dependency (DD). The second challenge stems from the intricacies of physiological signals. ExG signals can span 3 orders of magnitude in both amplitude (10µV–10mV) and frequency (0.1Hz–500Hz) (see Fig. 1). In addition, due to the difference in physiological activity of the signal sources, these signals could have sparse representations on completely different orthogonal basis. Furthermore, their sparsity is time-varying depending on the subject’s activity. For the best reconstruction results, the hardware design must be able to handle a high dynamic range and flexible problem settings, such as reconstruction basis, error tolerance, signal and measurement dimensions, and signal sparsity level.

So far, there has been very limited work and demonstration of dedicated SA solver chips]. The application-specific integrated circuit (ASIC) implementations of three greedy algorithms are first presented in for the Long-Term Evolution (LTE) channel estimation in wireless communication applications. These implementations in 180nm CMOS feature a target throughput of 2kS/s with the power consumptions of 88–209mW. A 65nm generic solver chip implementing the approximate message passing (AMP) algorithm is demonstrated in for an audio restoration application. This chip achieves a target throughput of 397kS/s at the power consumption of 177.5mW for processing audio signals that have a relatively lower sparsity. Prior designs mainly focused on achieving the target throughputs, with much less emphasis on power/energy and area efficiency. Besides, prior designs were optimized for a limited dynamic range and a fixed problem setting, making them unsuitable for bio-sensing applications.

In this project, we present a configurable and energy-efficient SA engine chip in 40nm CMOS that addresses above challenges and makes the CS technology accessible to mobile users. The chip testing results illustrate a reconstruction throughput of 66–237kS/s and a power consumption of 12.8mW when operating at VDD=0.7V. Such level of performance can support the simultaneous reconstruction of over 200 channels of compressively sampled ExG signals in real-time while consuming <1% of a smartphone’s power budget. The high energy-efficiency of our chip results from an algorithm-architecture co-design approach that facilitates the tight interactions between 1) algorithm reformulations that reduce the algorithm complexity by an order of magnitude; 2) a configurable system architecture that leads to nearly 100% utilization of computing resources; and 3) an efficient memory control scheme that cuts down the memory usage by half. The system architecture of the SA engine chip is optimized towards mapping the orthogonal matching pursuit (OMP) algorithm and its variants. Because human body is expected to have a low activity on average where ExG signals feature a high sparsity, especially when dynamic thresholding schemes are used, this is where OMP has better complexity-accuracy trade-off than other SA algorithms. The SA engine chip implements domain transformation by explicit matrix multiplication thereby supporting signal reconstruction on arbitrary bases. Additionally, the SA engine adopts the single-precision floating-point data format to achieve a large dynamic range and can be configured at run time to handle flexible problem settings and accurately recover a wide range of physiological signals.

Fig. 1. a) A CS-based wireless health monitoring system with desired system requirements: on-body WSNs that utilize a unified random encoding scheme to compress data for low energy and a mobile data aggregator that performs real-time signal reconstruction for timely prediction and proactive prevention. To further reduce the data size for storage or processing, only the sparse coefficients of the signal are reconstructed. Reconstruction takes <1% of the mobile’s power, allowing 2-3x energy saving of the sensors. b) Amplitude and frequency characteristics of ExG signals.

Fig. 2. Complexity characteristic of the OMP algorithms. The impact of the reformulation techniques is making OMP more energy-efficient for hardware implementations by simplifying the least-squares task.

Fig. 3 System architecture of the SA engine chip. Dynamic configuration of the system architecture in (a) the atom searching (AS), (b) the least squares (LS), (c) the estimation update (EU) task.

Fig. 4 Block diagram of PE in the vector core and PE interconnections provided by IB.

Fig. 5. Block diagram of scalar core.

Fig. 6. Data mapping scheme of PE caches in (a) the mirror and (b) the shuffle mode for handling Cholesky factorization.

Fig. 7. Testing environment of the SA engine chip.

Fig. 8. Die photo and summary of the SA engine chip.

Fig. 9. Averaged RSNR performance of the SA engine chip for ExG signal reconstruction. The ECG, EEG, and EMG signals are reconstructed on the Haar DWT, DCT, and DWT-DCT joint basis, respectively.

Fig. 10. Examples of ExG signals reconstructed on the SA engine chip with a >15dB RSNR performance. The ECG, EEG, and EMG signals are reconstructed on the Haar DWT, DCT, and DWT-DCT joint basis, respectively.

Fig. 11. Measured power and operating frequency at different supply voltages. Measured throughput and energy efficiency of the SA engine chip.

Publications

Journal Article

F. Ren and Marković, D. , “A Configurable 12–237 kS/s 12.8 mW Sparse-Approximation Engine for Mobile Data Aggregation of Compressively Sampled Physiological Signals”, IEEE Journal of Solid-State Circuits, vol. 51, no. 1, pp. 68-78, 2016.

(3.07 MB) Journal Article

F. Ren, Xu, W. , and Marković, D. , “Scalable and Parameterised VLSI Architecture for Efficient Sparse Approximation in FPGAs And SoCs”, IET Electronics Letters, vol. 49, no. 23, pp. 1440–1441, 2013.

(154.45 KB) Journal Article

Patent

D. Markovic and Ren, F. , “Scalable and Parameterized VLSI Architecture for Compressive Sensing Sparse Approximation”, US14/446,272, 2015.

Patent

Conference Proceedings

F. Ren and Marković, D. , “A Configurable 12-to-237KS/s 12.8 mW Sparse-approximation Engine for Mobile ExG Data Aggregation”, Proceedings of the 2015 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, pp. 68-78, 2015.

(6.94 MB) Conference Proceedings

Thesis

F. Ren, “A Scalable VLSI Architecture for Real-Time and Energy-Efficient Sparse Approximation in Compressive Sensing Systems”, University of California, Los Angeles, Los Angeles, 2015.

(5.71 MB) Thesis