Deep Learning AI transformer models have been increasing in popularity over the past few years. AI transformer models have been a solution that most scientists find very efficient. The reason why scientists have turned to this form of AI is to increase efficiency of computer vision through new AI transformer models. Transformer models help the efficiency by filtering out specific data that align with their input to get the output they need. Transformers have been shown to be used previously in services such as translation between languages and speech recognition. The one area that researchers have yet to implement is computer vision.

In an article on ai.facebook.com, Facebook AI discusses how they intend to implement the use of transformers for computer vision. They start with “To help bridge this gap, we are releasing Detection Transformers (DETR), an important new approach to object detection and panoptic segmentation. DETR completely changes the architecture compared with previous object detection systems. It is the first object detection framework to successfully integrate Transformers as a central building block in the detection pipeline.

DETR matches the performance of state-of-the-art methods, such as the well-established and highly optimized Faster R-CNN baseline on the challenging COCO object detection dataset, while also greatly simplifying and streamlining the architecture.

DETR offers a simpler, more flexible pipeline architecture that requires fewer heuristics. Inference can be boiled down to 50 lines of simple Python code using elementary architectural blocks. Moreover, because Transformers have proven to be a powerful tool for dramatically improving the performance of models in other domains, we believe additional performance gains and improved training efficiency will be possible with additional tuning.

We are providing the source code as well as pre-trained models in PyTorch here.”

Reframing the task of object detection

DETR casts the object detection task as an image-to-set problem. Given an image, the model must predict an unordered set (or list) of all the objects present, each represented by its class, along with a tight bounding box surrounding each one.

This formulation is particularly suitable for Transformers. We chain a convolutional neural network (CNN), which extracts the local information from the image, with a Transformer encoder-decoder architecture, which reasons about the image as a whole and then generates the predictions.

Traditional computer vision models typically use a complex, partly handcrafted pipeline that relies on custom layers in order to localize objects in an image and then extract features. DETR replaces this with a simpler neural network that offers a true end-to-end deep learning solution to the problem.

The DETR framework consists of a set-based global loss, which forces unique predictions via bipartite matching, and a Transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. Previous attempts to increase efficiency of computer vision through new AI transformer models used architectures such as recurrent neural networks for object detection were much slower and less effective because they made predictions sequentially rather than in parallel.

Transformers’ self-attention mechanisms allow DETR to perform global reasoning on the image as well as on the specific objects that are predicted. For example, the model may look at other regions of the image to help make a decision about the object in a bounding box. It can also make predictions based on relationships or correlations between objects in an image. If DETR predicts that an image contains a person standing on the beach, for example, it knows that a partially occluded object is more likely to be a surfboard. In contrast, other detection models predict each object in isolation.

Increase efficiency of computer vision through new AI transformer models

We also demonstrate that this pipeline can be extended to related tasks such as panoptic segmentation, which aims at segmenting distinct foreground objects while simultaneously labeling all the pixels from the background. DETR treats foreground items, such as animals or people, and background items, such as sky or grass, in a truly unified manner.”Check out how aiXplain.com can help with your computer vision needs!

Pin It on Pinterest