Why DETRs are replacing YOLOs for real-time object detection

57 minutes ago 2

Real-time object detection lies at the heart of any system that must interpret visual data efficiently, from video analytics pipelines to autonomous robotics. Detector architectures for such tasks need to deliver both high throughput and accuracy in order to excel.

In our own pipelines, we phased out older CNN-based detectors in favor of D-Fine, a more recent model that is part of the DEtection Transformer (DETR) family. Transformer-based detectors have matured quickly, and D-Fine in particular provides stronger accuracy while maintaining competitive inference speed.

YOLO has long been the leading standard for real-time detection, but the latest DETR variants are now consistently proving to be the better alternative. Beyond the accuracy gains, an equally important advantage is the far more permissive license that comes with it.

YOLO’s licensing issue

The YOLO series is developed and maintained by Ultralytics. All YOLO code and weights are released under the AGPL-3.0 license. Long story short, this license only allows commercial usage under the strict condition that any code modifications or weights should be made publicly available. On the contrary, all DETR models to date were released under the Apache 2.0 License, allowing for free use and modifications for commercial and proprietary use.

Next to licensing, there are others reasons why we like working with DETRs:

DETRs treat object detection as a direct set-prediction problem. This eliminates hand-crafted components such as non-maximum suppression that introduce additional hyperparameters and slow down the detection pipeline.
Modern GPU architectures are heavily optimized for efficient attention operations such as flash attention, making transformers increasingly more suitable for real-time applications.
Transfer learning from vision foundation models such as the recent DINOv3 fundamentally augments the capabilities of DETRs.

We have had nothing but great experiences with DETRs so far. They adapt remarkably well to new datasets, even when trained from scratch. For the right use cases, pre-training the models on datasets such as COCO and Objects365 further boosts performance. About time for a post on this exciting topic!

A short overview of what you can expect of the remainder of this blogpost. We will:

dive in detail into the original DETR paper to understand its core concepts;
discuss the most important advancements leading to the real-time adoption of DETRs;
compare two leading DETR models to the latest YOLO 11 model to draw some important conclusions.

Let’s go!

DETR: transformer for NMS-free object detection

All Detection Transformer architectures have the same underlying structure. A (CNN-) backbone is used to extract image features. These features are fed to a transformer encoder-decoder structure that is able to predict accurate bounding boxes for object in the image. The resulting N decoder output embeddings are independently projected to bounding box coordinates and class labels.

Why transformers?

Intuitively, the encoder in DETR transforms the dense backbone features into a semantically structured representation of the image that captures relationships between regions through global self-attention.

The transformer decoder takes a fixed set of N learned object queries, each representing a potential object slot. It then iteratively refines these to produce final bounding boxes and class predictions. It does this through two attention operations:

Self-attention among the queries, enabling them to model interactions and avoid duplicate detections (e.g., two queries focusing on the same object).
Cross-attention between the queries and the encoder’s output features, allowing each query to attend to the most relevant parts of the image and extract the corresponding visual evidence.

Through the clever use of attention in the decoder, DETR replaces traditional components like anchor boxes and non-maximum suppression with a fully end-to-end transformer-based detection process.

Direct set prediction

DETR reframes object detection as a direct set-prediction problem. Given an image, it predicts a fixed set of N bounding boxes corresponding to the object queries. Because N typically exceeds the number of actual objects, many predictions correspond to a special “no-object” class and are discarded at inference. During training, the Hungarian algorithm performs bipartite matching between predicted and ground-truth boxes, ensuring each ground-truth box is paired with exactly one prediction in a permutation-invariant way. The loss is then computed on these matched pairs.

Overcoming DETRs shortcomings

Despite its elegance and powerful prediction paradigm, slow training converge and low performance on small objects limited adoption in practical systems early on. Over the years, several enhancements drastically improved the performance of Detection Transformers:

Deformable DETR introduced deformable attention, an efficient multi-scale attention mechanism tailored to the task of object detection.
The authors of Efficient DETR were the first to use top-k query selection for better initialization of the object queries for the decoder.
DN-DETR drastically improved training convergence using an auxiliary denoising task of training bounding boxes.

Real-time transformer object detection

From 2024 onwards DETRs really started to challenge YOLO in real-time detection, eventually surpassing them in accuracy while remaining competitive in speed and efficiency. There are two schools of thought that compete for the state-of-the art nowadays:

RT-DETR (real-time DETR) sticks to the original DETR architecture and focuses on optimizing the encoder and the initialization of the object queries. D-Fine currently leads this family with a heavily optimized training strategy centered on the decoder. Very recently, DEIMv2 extends it further by integrating DINOv3 features in its backbone.
LW-DETR (light-weight DETR) adopts a simpler idea: replace the traditional CNN backbone and encoder with a pure Vision Transformer (ViT). RF-DETR (Roboflow DETR) leverages this especially well by starting from a pretrained DINOv2 encoder.

Work on Detection Transformers is very much alive: DEIMv2 was released less than two months ago, while Roboflow put their paper on RF-DETR on Arxiv just last week!

Object detection performance

How do these advancements reflect on performance benchmarks? The figure here underneath summarizes the performance of YOLO11, D-Fine, and RF-DETR for relevant model sizes on the well-known COCO dataset.

Some important take-aways from these numbers:

Both D-Fine and RF-DETR clearly outperform YOLO 11 for all sizes.
RF-DETR’s smaller models stand out, with the nano variant outperforming the others by a wide margin. This is likely because RF-DETR-N already benefits from a strong DINOv2 backbone.
D-Fine’s performance scales the best with model size, with the large variant scoring a whopping 57.4 mAP.

Parameter count

So, RF-DETR for small, very fast models and D-Fine when things get more complex? There is another side to the story. To finish of this post, I’d like to highlight an important difference between D-Fine and RF-DETR. For that, let’s take a look at the following figure:

One of the first things to stand out is that D-Fine and YOLO11 become significantly lighter as their model sizes shrink, while RF-DETR’s parameter count declines by only around 5 million. This somewhat surprising observation results from the fact that RF-DETR was trained with a technique called Neural Architecture Search (NAS). NAS automatically finds network architectures that are Pareto optimal for the accuracy-latency trade-off.

Interestingly, the “small” RF-DETR architectures found by NAS end up only slightly lighter than the “large” variants. RF-DETR model sizes thus reflect speed rather than parameter count. D-Fine‘s model sizes on the contrary are on par with YOLO 11, making them the more versatile DETR architecture that can be adapted in a wide range of scenarios, including resource-constrained edge environments.

Conclusion

Real-time Detection Transformers represent one of the most significant recent shifts in computer vision. Their rapid evolution shows how transformers have become not only viable but actually preferred in scenarios that demand both high speed and high accuracy, even in resource-constrained scenarios. Just as important, their Apache 2.0 License makes them easy to use, enabling practical adoption beyond academic benchmarks.

D-Fine and RF-DETR have set the new standard for real-time object detection moving forward. D-Fine shows the best scaling in both speed, accuracy, and model size. The small RF-DETR variants are remarkably accurate and fast for their size, but the bigger models fall short of D-Fine when evaluated on the well-known COCO dataset. However, the field keeps on changing rapidly, so we’ll keep on tracking progress on both to make the best possible choices for every problem.

If you’re working on demanding detection problems where accuracy, robustness, and efficiency matter, we can help. We tailor DETR-based models to your specific application, integrate them in video processing pipelines, and set up continuous improvement loops to ensure performance keeps rising as new data comes in. Reach out; we’d be excited to turn cutting-edge Detection Transformer research into real, production-grade impact for your system.

Read Entire Article