Enhancing Computer Vision with Deep-Learning Models

Gordon Cooper

Feb 28, 2023 / 4 min read

Is that a dog in the middle of the street? Or an empty box? If you’re riding in a self-driving car, you’ll want the object detection and collision avoidance systems to correctly identify what might be on the road ahead and direct the vehicle accordingly. Inside modern vehicles, deep learning models play an integral role in the cars’ computer vision processing applications.

With cameras becoming pervasive in so many systems, cars aren’t the only ones taking advantage of AI-driven computer vision technology. Mobile phones, security systems, and camera-based digital personal assistants are just a few examples of camera-based devices that are already using neural networks to enhance image quality and accuracy.

While the computer vision application landscape has traditionally been dominated by convolutional neural networks (CNNs), a new algorithm type—initially developed for natural language processing such as translation and question answering—is starting to make inroads: transformers. A deep-learning model that processes all input data simultaneously, transformers likely won’t completely replace CNNs but will be used alongside them to enhance the accuracy of vison processing applications.

Transformers have been in the news lately thanks to ChatGPT, a transformer-based chatbot launched in November 2022 by OpenAI. While ChatGPT is a server-based transformer requiring 175 billion parameters, you’ll learn more in this blog post about why transformers are also ideal for embedded computer vision. Read on for insights into how transformers are changing the direction of deep-learning architectures and for techniques to optimize the implementation of these models to derive optimal results.

security camera

Attention-Based Networks Deliver Benefit of Contextual Awareness

For more than 10 years now, CNNs have been the go-to deep-learning model for vision processing. As they’ve evolved, CNNs have accurately been applied for image classification, object detection, semantic segmentation (grouping or labeling every pixel in an image), and panoptic segmentation (identifying object locations as well as grouping or labeling every pixel in every object). However, without any modifications of transformers other than replacing language patches with image patches, the transformer has shown it can beat CNNs in accuracy.

It was 2017 when the Google Research team shared an article introducing the transformer, defining it as “a novel neural network architecture based on a self-attention mechanism that we believe to be particularly well suited for language understanding.” Fast-forward to 2020, when Google Research scientists published an article on the vision transformer (ViT), a model based on the original transformer architecture. According to the article, the ViT “demonstrates excellent performance when trained on sufficient data, outperforming a comparable state-of-the-art CNN with four times fewer computational resources.” Indeed, these transformers, which need to be trained with very large data sets, showed how adept they are at vision tasks such as image classification and object detection.

Since transformers can understand context, they’re good at learning complex patterns for accurate object detection.

A key aspect of transformers that aids in their proficiency for vision applications is their property attention mechanism that enables the models to understand context. Like CNN, a transformer can detect that the object on the road ahead is an injured dog, not a cardboard box. It does so, but devoting more focus to the small, but important, parts of the data–the item in the road–and not the less useful pixels that represent the rest of the road. In other words, not all pixels are treated equally, making transformers better at learning more complex patterns than CNNs can (CNNs typically address a frame of data without knowing what came before or after.) As research and development continues, transformer model sizes are now similar to CNN model sizes.

While the frames-per-second performance depends on the hardware upon which the models are run, CNNs tend to perform at faster rates than transformers, which require many more computations. However, transformers are poised to catch up. GPUs can support both, but for real-world applications that need the highest performance in the smallest area with the least power, dedicated AI accelerators (like NPUs or neural processing units) are a better option.

For greater inference efficiency, a vision processing application can utilize both CNNs and transformers. Full visual perception calls for knowledge that may not be easily acquired by a vision-only model. Multi-modal learning provides a deeper understanding of visual information. Also, attention-based networks like transformers are well suited to applications that integrate multiple sensors, such as automotive.

Optimizing Performance of Transformers and CNNs with NPU IP

Transformers consist of a handful of operations:

  • Matrix multiplication
  • Element-wise addition
  • Softmax mathematical function
  • L2 normalization
  • Activation functions

While most current AI accelerators are optimized for CNNs, not all of them are ideal for transformers. Transformers require computational power to perform voluminous calculations and to support their attention mechanism. The Synopsys ARC® NPX6 NPU IP is an example of an AI accelerator that can handle CNNs and transformers. The ARC NPX6 NPU IP’s computation units include a convolution accelerator for matrix-matrix multiplications that are essential to both deep learning models, as well as a tensor accelerator for transformer operations and activation functions. The IP delivers up to 3,500 TOPS performance and industry-leading power efficiency of up to 30 TOPS/Watt. Design teams can also accelerate their application software development with the Synopsys MetaWare MX Development Toolkit. The toolkit provides a comprehensive software programming environment that includes a neural network software development kit and support for virtual models.


Natural language processing applications have enjoyed the computational prowess of transformers for several years. Now, real-time vision processing applications are getting in on the action, taking advantage of the attention-based network’s capacity for providing contextual awareness for greater accuracy. From smartphones to security systems and cars, camera-based products are growing increasingly adept at delivering high-quality images. Adding transformers to the deep-learning infrastructure of embedded vision camera systems will only give rise to even sharper images and more accurate object detection.

Continue Reading