Breaking Down YOLO: How Real Time Object Detection Works Step by Step
Last Updated on October 28, 2025 by Editorial Team
Author(s): Abinaya Subramaniam
Originally published on Towards AI.
Object detection is one of the most interesting areas of computer vision. It is the process of identifying and locating objects in an image. Popular examples include detecting cars on a road, identifying products in a store, or recognizing people in a crowd. Among the many techniques available, one model stands out for its exceptional speed and accuracy. This model is called YOLO, which stands for You Only Look Once.

YOLO became famous because it can detect objects in real time. To understand how it works, we first need to explore the problem and how older methods solved it.
Why YOLO Was Needed
Before YOLO was introduced, object detection models relied on a multi-stage pipeline. A popular example is R-CNN and its later improvements. These models first used separate algorithms to search for many possible regions where an object might exist. Each of these candidate regions was then cropped from the original image and passed into a classifier to estimate what object it might contain. Finally, another step adjusted the bounding boxes to better fit the objects.
While this approach worked reasonably well, it had three major drawbacks. First, it was slow because the model had to analyze hundreds or even thousands of proposed regions individually. Second, the process was quite complicated because it required several different models and hand-designed components working together. Third, it demanded heavy processing power, making it unsuitable for real time tasks such as video analysis, autonomous driving, or live surveillance.
YOLO changed this approach completely. Rather than proposing regions and examining them one by one, YOLO treats detection as a single prediction problem. It examines the entire image in a single forward pass through the network and directly predicts object locations and classes. This design greatly reduces computation time, simplifies the architecture, and allows the model to operate fast enough for real time applications.
YOLO Treats Detection as a Single Problem
Instead of separating object localization and classification into different tasks, YOLO treats them as one combined regression problem. The model learns to take an image and directly output bounding boxes, object confidence scores, and class predictions.
This simple design makes the model faster and easier to train.
Dividing the Image into a Grid
One of the most important ideas in YOLO is how it divides the image into a grid. When an image enters the model, YOLO splits it into an evenly spaced layout of rows and columns. For example, a common configuration is a 13 by 13 grid, meaning the image is divided into 169 equal sections. Each section is a grid cell.

This grid system helps YOLO assign responsibility for detection. Instead of allowing every part of the model to react to every object, each grid cell only needs to focus on the objects whose center point falls inside its boundaries. If the center of a dog is located in the upper left cell, that cell becomes responsible for predicting the box around the dog and identifying its category.
This approach allows YOLO to examine the entire image at once while still keeping track of where objects are located. It also helps the model detect multiple objects at different positions because each cell looks at its own local region. By doing this, YOLO can process many parts of the image at the same time, rather than scanning small pieces one after another like older methods.
Predicting Bounding Boxes
Once the image is divided into a grid, each cell predicts a set of values that describe possible objects within its area. These values form the bounding box prediction. A bounding box is a rectangle drawn tightly around the edges of an object in an image.
Each grid cell predicts the following five elements:
- The x coordinate of the bounding box center, measured relative to the boundaries of the cell
- The y coordinate of the bounding box center, also relative to the cell
- The width of the bounding box, measured relative to the total width of the image
- The height of the bounding box, measured relative to the total height of the image
- A confidence score
The first four values tell the model where the box should be drawn and how large it should appear. Because these values are relative, they remain consistent even if the image size changes.

The confidence score is the most important part. It represents how certain the model is that there is an object inside that cell. It also measures how accurately the bounding box matches the shape and position of the actual object. A high confidence score means the model believes strongly that it has found an object and that the predicted box fits well. A low score means the model is unsure or believes nothing important is present in that region.
By predicting bounding boxes across all grid cells at once, YOLO can identify many objects of different sizes and shapes that appear throughout the image. This design is what makes it capable of real time detection without slowing down.
Class Prediction
Besides predicting the position and size of a bounding box, each grid cell in YOLO also predicts the category, or class, of the object that might be located within its boundaries. These classes come from the dataset used to train the model. Depending on the training set, they can include everyday objects such as person, bicycle, dog, cat, car, or even more specialized items found in industrial or medical environments.
The model does not simply choose one class immediately. Instead, it assigns a probability to each possible category. For example, a grid cell might predict that the object in its area has a 70 percent chance of being a dog, a 20 percent chance of being a cat, and a 10 percent chance of being a rabbit. These probabilities reflect how confident the model is about its classification based on what it has learned during training.
To finalize the prediction, YOLO multiplies the class probability by the confidence score of the bounding box. The confidence score represents the model’s certainty that an object truly exists in that cell and that the bounding box fits it correctly. By combining these two values, the model calculates a final score for each category. This ensures that a class is only considered strong if both the bounding box is accurate and the category prediction is reliable.
Higher scores indicate that the model is more confident both about the object’s presence and about the type of object it is. This combined idea allows YOLO to filter out uncertain detections and focus on the most meaningful ones.
Anchor Boxes
Objects can have different shapes. For example, a person is usually tall, while a car is wider. To handle this variation, YOLO uses predefined bounding box shapes known as anchor boxes. Instead of predicting any random shape, the model adjusts these anchors to better match the object. This improves accuracy and makes learning easier.
Removing Duplicate Predictions with Non Maximum Suppression
Sometimes the model predicts multiple bounding boxes around the same object. To fix this, YOLO uses a method called Non Maximum Suppression, often shortened to NMS.
NMS keeps the box with the highest confidence score and removes the others that overlap too much. This ensures that each object in the image is detected only once.
The Role of the Neural Network
At the core of the YOLO model lies a deep convolutional neural network, often referred to as the backbone. This network is responsible for learning and extracting important visual features from the input image. When an image enters the network, it first passes through several convolutional layers. These layers detect simple features such as edges, corners, and color patterns. As the image moves deeper into the network, the layers begin recognizing more complex structures, including textures, object parts, and eventually entire shapes. This gradual progression allows the network to build a detailed understanding of what might be present in different regions of the image.
Pooling layers or other downsampling operations are commonly used to reduce the resolution of the image representation as it travels through the network. This step helps the model focus on meaningful patterns while reducing unnecessary detail. It also lowers computational cost, allowing YOLO to run more efficiently. By the time the data reaches deeper layers, the model has converted the original image into a compact representation of high-level features that describe the objects within it.
After feature extraction, the processed information flows into a set of prediction layers. These layers are specifically designed to produce the outputs needed for object detection. They generate bounding box coordinates, confidence scores, and class probabilities for each grid cell. The bounding box coordinates determine where the object is located. The confidence score determines whether an object truly exists in that region. The class probabilities estimate which type of object it might be.
This unified design, where a single network handles both feature extraction and prediction, is one of the main reasons YOLO achieves real time speed. Instead of using separate models for detection and classification, YOLO combines everything into one integrated architecture. This efficiency allows the model to process images rapidly while still maintaining reliable accuracy, making it highly suitable for real world applications where speed is essential.
Evaluating YOLO Model Predictions
Evaluating an object detection model such as YOLO requires more than simply checking whether it identifies the presence of objects. It must also be judged on how accurately it places bounding boxes around those objects and how correctly it classifies them. To measure these factors, researchers use well-established evaluation metrics that compare the model’s predictions against ground truth data. Ground truth refers to the manually annotated boxes and labels created by humans, which act as the correct answers for the dataset.
One of the most important metrics used is Intersection over Union, commonly written as IoU. IoU measures how well the predicted bounding box overlaps with the actual bounding box. It does this by dividing the area where the two boxes overlap by the total area covered by both boxes combined. If the overlap is large, the IoU score will be high. Many evaluation standards consider a prediction correct only if its IoU is above a certain threshold, such as 0.5. This prevents the model from receiving credit for bounding boxes that are placed loosely or inaccurately.
Another major evaluation metric is mean Average Precision, often abbreviated as mAP. Unlike simple accuracy measurements, mAP takes both precision and recall into account. Precision measures how many of the model’s detections are correct. If the model predicts a large number of bounding boxes that do not match any real object, precision will decrease. Recall reflects how many of the actual objects in the image were successfully detected. A model that misses many real objects will have a lower recall score. mAP summarizes the relationship between precision and recall at various confidence levels and averages the results across multiple object classes, providing a single score that represents overall performance.
Together, IoU and mAP offer a detailed view of how well YOLO operates. IoU focuses on spatial accuracy, ensuring the bounding boxes are drawn tightly and correctly. mAP captures how reliably the model identifies and classifies objects. By studying these metrics, researchers can compare versions of YOLO, evaluate improvements, and determine how well the model is suited for real world tasks, including autonomous driving, surveillance, or robotics.
Why YOLO Is Fast
There are several reasons for YOLO’s impressive speed:
- It processes the full image in a single pass
- It avoids region proposal steps
- It is fully convolutional
- Predictions are parallelized
All of these design choices allow YOLO to run at real time speeds, even on live video.
Strengths of YOLO
YOLO offers several advantages:
- Extremely fast performance suitable for live applications
- Simple and unified architecture
- Good accuracy on common objects
- Efficient computation
These strengths make YOLO popular in fields such as autonomous driving, traffic monitoring, robotics, and security systems.
Limitations of YOLO
Although YOLO works very well in many situations, it also has some limitations:
- It can struggle with detecting very small objects
- It may perform poorly when objects are crowded
- Its accuracy sometimes depends heavily on good anchor design
However, newer versions of YOLO have gradually reduced many of these weaknesses.
Real World Applications
YOLO is used in many practical scenarios including:
- Identifying vehicles in traffic cameras
- Counting people in crowded places
- Detecting wild animals for conservation
- Organizing products in warehouses
- Monitoring movements in sports
The ability to process video frames quickly makes YOLO an excellent choice for dynamic environments.
Summary
YOLO revolutionized object detection by simplifying the entire process into a single prediction step. By dividing images into grids, predicting bounding boxes with predefined anchors, and removing duplicates with Non Maximum Suppression, it achieves both speed and reliable accuracy.
Rather than processing small patches separately, YOLO analyzes the entire image at once. This unique approach allows it to detect objects efficiently in real time, which is why it remains one of the most influential models in computer vision today.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.