Cloud native EDA tools & pre-optimized hardware platforms
Gordon Cooper, Product Marketing Manager, Synopsys
Computing your way from a low-resolution, pixelated image to a smooth, high-resolution output image was an unsolvable problem before deep learning and neural network algorithms. Thanks to breakthroughs in algorithm development, images can be zoomed in 8x or 16x and recreate the resolution. To perform this trick in real-time, you need a processor with a dedicated neural network accelerator, like Synopsys’s DesignWare® ARC® EV processor IP, to handle the intense computational horsepower required without breaking power or area budgets. Image super-resolution techniques are being applied to applications including video surveillance, face recognition, medical diagnosis, digital TVs, gaming, multi-function printers, and remote sensing imaging.
We’ve all seen the old television or movie trope. The gritty detective has caught a break – they’ve got an image from a surveillance camera of the criminal driving away from the scene of the crime. But wait, when zooming in, the image is very blurry and pixelated. Turning toward the technician, the detective will say in his best gravelly voice, “Can you enhance the image?” After furiously pounding away on the keyboard for a few seconds (or in some iterations, letting it process for a few hours), the technician spins the monitor around to show a crisp higher resolution image that clearly shows the license plate of the getaway car. Case closed!
The problem is that image resolution just doesn’t work that way. It was an impossible request no matter how much time was available for image processing. State of the art techniques before deep learning allowed a level of image sharpening – but couldn’t correct a severely blurred image (or a very pixelated zoomed-in image).
Examples of image sharpening or resizing techniques can be found in programs like Adobe Photoshop, with three types in wide use:
Figure 1: The original 200x200 image (left) was scaled to a 50x50 image and then neighbor, bilinear and bicubic interpolation techniques were applied. While there is ‘sharpening’ of the 50x50 image, these computational techniques can’t recover the data lost from the original higher resolution image.
The limitation of all three of these techniques, however, is a theoretic concept called Data Processing Inequality. In simple terms, Data Processing Inequality states that you can’t add information to an image that is not already there. You can’t recover missing data as you zoom into an image. While these techniques can provide a modest amount of sharpening of the original pixelated image, they can’t help you scale up 8 or 16 times.
Or at least that was true until the advent of deep learning and super-resolution neural networks.
If bicubic interpolation techniques are thwarted by the harsh realities of Data Processing Inequality, why would deep learning algorithms fare any better? Well, they cheat. The reason learning-based algorithms are ideal for the task of scaling images and recovering resolution is that they are ‘trained’ with large datasets (i.e., fed large amounts of annotated images that are used to calibrate the networks’ coefficients). The additional data needed to resurrect images is provided by a neural network’s training set – it doesn’t have to come from the original image. For example, if a neural network is trained to learn faces, it can then do a reasonable job of inserting data from its training set into the original image as it is scaled to high resolution.
The first deep learning method for single image super-resolution was Super-Resolution Convolutional Neural Network (SRCNN) in 2014. During training with SRCNN, the low-resolution input image is upscaled to the desired resolution using bicubic interpolation which is then fed into a fairly shallow CNN (skipping pooling layers to keep the image resolution the same size). A Mean Square Average (MSE) loss function is applied between the output and the original image. Over multiple training iterations, the loss is minimized as much as possible to produce the best output image. The results (Figure 2) show an impressive improvement in Peak Signal-to-Noise Ratio (PSNR) for SRCNN over the bicubic interpolation.
Figure 2: SRCNN shows significant improvement for super-resolution over bicubic interpolation after just a few training iterations. (Source: Image Super-Resolution Using Deep Convolutional Networks. https://arxiv.org/abs/1501.00092)
After SRCNN blazed the trail for using CNNs for super-resolution, numerous other methods improved on the concept. Fast Super-Resolution Convolutional Neural Network (FSRCNN), in 2016, replaced the bicubic interpolation of SRCNN with more CNN layers and, combined with other techniques, created a faster solution with higher image quality. Very Deep network for Super-Resolution (VDSR), also in 2016, expanded SRCNN’s three layers to twenty to improve accuracy. Super Resolution Residual Networks (SRResNets), published in late 2015, added residual network (ResNet) layers – up to 152 – to improve accuracy. All of these approaches provided high PSNRs (higher accuracy), but often missed high-frequency details and were considered not as pleasing to the eye.
In 2017, Super-Resolution Generative Adversarial Network (SRGAN) applied the concept of GANS, published in 2014, to super resolution with impressive results. GANs consist of two networks: a Generator and a Discriminator (Figure 3) which compete against each other to achieve ‘adversarial goals’. During network training, the Generator inputs low resolution images and tries to create high resolution versions. The Discriminator network tries to determine if its own inputs are real high-resolution images or images upscaled by the Generator. Put another way, the Generator tries to pass off a counterfeit image while the Discriminator tries to catch it. This iterative process during training forces the Generator to improve its outputs.
After the training portion is completed, only the Generator network of SRGAN is needed for converting low resolution images to upscaled higher resolution images.
Figure 3. The Generator and Discriminator networks of SRGAN compete against each during training.
A critical part of SRGAN is the use of a perceptual loss function that compares both the pixel-wise MSE loss with the adversarial loss (the probability that the discriminator sees a natural image, not a generated image). Because of this, SRGAN’s output images don’t score as high of a PSNR value but are considered more pleasing to the eye.
Figure 4. The high-resolution output of SRGAN has a lower PSNR than SRResNet and other SRCNN variants, but by applying a perceptual loss function, SRGAN produces results more pleasing to the eye. (Source: https://arxiv.org/pdf/1609.04802.pdf)
Research is now providing variants of SRGANs. Enhanced Super-Resolution Generative Adversarial Network (ESRGAN) has become popular in the gaming community and has been used to upscale vintage video games. Although trained on natural images, ESRGAN applied to a pixelated vintage video came can improve the quality of the graphics.
Another interesting use of super resolution is restoring and colorizing old movies. Super-resolution can scale up the frame rate and resolution, fill in missing data, improve blurriness, and generate realistic colorizations of black and white movies.
Processing a single image, video games, or old movies for super-resolution offline does not require real-time performance. If time is not a concern, you can have CPUs or GPUs crunch away at the solution in the background. If an application calls for rendering and displaying an imagine in real-time – perhaps for gaming or augmented reality – then rendering in a lower resolution will save power and improve frames-per-second (fps) if super resolution can be used to upscale the imagine before displaying. Solving the problem of upscaling images or upscaling from a lower to a higher video resolution on the fly requires dedicated neural network solutions.
Synopsys’s DesignWare ARC EV family of processor IP provides scalable neural network solutions for a range of real-time super resolution needs. Synopsys’s EV architecture combines programmability and hardware optimizations to provide the fastest performance possible with smallest area and power. The EV7x (Figure 5) combines a vision engine (512b vector DSP for single instruction, multiple data (SIMD) parallel processing) with a deep neural network accelerator that can scale from 880 to 3,520 multiplier–accumulator (MAC) units – the key building block for neural network implementations.
Figure 5. DesignWare EV7x Vision processor IP implements neural network algorithms in its deep neural network accelerator by inputting portions of the input images, trained coefficients and intermediate feature maps from LPDDR5 to configuration internal memory.
The DesignWare EV7x Processor IP family is supported by MetaWare EV (MWEV) software development toolkit, an integrated toolchain for compilation, debugging, optimization, simulation and mapping of the trained neural networks. The MWEV Neural Network Software Development Kit (NN SDK) takes super-resolution graphs like SRCNN, FSRCNN, VDSR and maps then automatically into the EV7x hardware for real-time operation.
Real time implementations vary in performance requirements. A multi-function printer might need to sharpen images at a five fps. However, upscaling from a 30fps video to a 60fps image requires a lot more processing in a shorter amount of time. Requirements impacting processor performance include:
With these parameters, designers can determine which configuration of the deep neural network accelerator they needed. Synopsys can collaborate with designers to provide the fps for different MAC configurations for a chosen super-resolution network based on the input resolution and bandwidth limitations. For example, a multi-function printer implementation might require only the smallest EV71 processor with an 880 MAC accelerator. A high-end gaming application to upscale video on-the-fly might require a larger processor like EV72 or EV74 and up to 3,520 MACs.
Power Requirements
In addition to performance, power is often a critical requirement in embedded systems. Power can be calculated once the requirements are known. However, simulating power accurately for neural networks is very difficult given their computational complexity and the time it takes to simulate all those calculations (multiple weeks of time might be needed). Synopsys uses an emulation-based model which is both quick and highly accurate to determine power for the EV7x hardware to implement a neural network.
Bit Resolution Requirements
Neural networks are often trained using 32-bit precision; however, this is overkill for neural network implementation. The EV7x’s DNN provides 8b resolution and up to 12 b resolution when needed for better accuracy. The MWEV NN SDK will quantize the neural network to the chosen bit resolution in hardware. It is also possible to optimize the quantization/precision per layer. Wherever possible, 8b is used and 12b only where accuracy is needed.
Super resolution neural networks will continue to evolve as researchers improve what is today’s state-of-the art. As seen in the evolution on CNN classification networks, the initial focus will be on improving accuracy and then shift to improving algorithm efficiency. The overall goal will be to get the most accuracy with minimum amount of computations and data movement for resizing / upscaling a lower resolution image to a higher resolution image that is pleasing to the human eye. Because the DesignWare ARC EV7x processor IP family is programmable, it can evolve with the constant pace of research into super resolution networks.