The purpose of the model is to count vehicles on Indian roads with good accuracy and speed. The input for the model comprises of videos captured by specialized IoT devices affixed at suitable viewing heights and angles on the roads to be analysed. A video consists of a stream of frames, each of which is nothing but a still image. A typical video consists of 24-30 frames in a single second (at this pace, individual image frames are indistinguishable to the human eye, and appear as a continuum). To analyse the objects in a video, an object detection algorithm is typically applied to all the image frames, or to a subset of them. The relevant information is then stored, processed, and utilized to obtain a final answer. In our case, a pre-trained convolutional neural network named YOLO v3 is employed for detection of different classes of vehicles such as cars, buses, motorcycles, etc.
Deep learning is a subset of machine learning that involves the use of neural networks and their variations with a large number of layers, used to capture and learn complex features of the training data set. A convolutional neural network (CNN) is a deep learning model that is particularly adept at extracting and learning features from images. It uses a long sequence of mathematical operations called convolutions, in combination with other suitable functions, all together arranged in the form of layers. Given a large amount of training data, a CNN with a sufficiently many layers can learn to detect and classify objects belonging to thousands of different categories, and can often learn robust, context-independent features characteristic of an object or class of objects. CNN’s have a multitude of flavours, each suited to a different task. For object detection, some popular CNN-based models are region proposals (R-CNN, Fast R-CNN, and Faster R-CNN), Single Shot MultiBox Detector (SSD), and You Only Look Once (YOLO).
Object detection on an image is, by itself, a computationally intensive procedure, since it involves applying a deep convolutional neural network to several regions of the image, localizing an object (separating it from the background and other objects), and predicting its class. (This is a lot more complex than object classification, in which each image contains a single object in its foreground, and occupying most of the image area and the task is only to predict the category of the object.) Thus, when object detection algorithms are applied to videos, which typically contain thousands of image frames, it is extremely hard to achieve desirable processing speeds. This is a major reason why we use YOLO v3, which has been built in a way suited to real-time object detection and is the state-of-the-art in terms of speed. This also necessitates the use of a parallel computing device called a GPU (Graphical Processing Unit).
In order to count objects in a video, it is insufficient to merely detect and classify them correctly. Since the same objects are likely to appear for multiple image frames in the video and thus to be detected multiple times, it is necessary to tag each detected object with an identity to ensure that it is counted exactly once. This task cannot be accomplished with an object detection model, since such a model works independently on individual image frames. Here arises the necessity to use object trackers. An object tracking algorithm tracks a moving detected object in a video and identifies it after it has moved to a new location. Typically, object tracking is not as computationally intensive as a deep convolutional network, and can be applied repeatedly at fixed, short time intervals.
The model employs a convolution neural network called YOLO v3 for vehicle detection and classification. YOLO v3, which is originally built in the platform DarkNet, is imported for use with a TensorFlow backend, using the library ImageAI, which has been built by Moses Olafenwa and John Olafenwa.
The centroid of each detected object is fed into a centroid tacker, and the bounding box into the correlation tracker. In the frames where detection does not take place, the correlation tracker tracks the previously detected vehicles. Before detection, the centroid tracker is updated with the coordinates returned by the correlation tracker in the last image frame of the video. Finally, tracked objects, which have a unique identifying number, are counted as they cross a threshold. The algorithm is also able to distinguish between two directions of vehicle movement using the collection of all object centroids.
YOLO (You Only Look Once) is a popular object detection model introduced by Redmon et. al. (2016). We use YOLO v3, an improvisation to the original by Redmon & Farhadi, pre-trained on the COCO (Common Objects in Context) dataset. Typically, object detection is modelled as a multi-region classification problem: a classifier for different objects is applied repeatedly to different regions of the image in order to detect locations of the object. The R-CNN was a significant improvement over this computationally intensive procedure, beginning by first generating region proposals and then running the classifier in regions likely to contain the objects. YOLO, however, does not break down object detection into different steps, and models object detection as a single regression problem, simultaneously predicting bounding box coordinates and class probabilities.
YOLO achieves its function by dividing the image into several cells and predicting bounding boxes and confidence scores for each cell. The entire algorithm is a single run of a convolutional neural network, along with some pre-processing and post-processing of the bounding box predictions. Unlike other object detectors using sliding windows and region proposals, YOLO models can view the image in its entirety, and may thus learn important contextual features from the background, other objects in the image, etc.
One important question that arises for object detectors is on the measure of their performance. While classification models are easily evaluated since their errors are discrete and may take only two values (depending on whether the prediction is right or wrong). Simple regression models may also be evaluated by comparison of the predicted values and actual values. Since YOLO models treat object detection as a regression problem, it is important to set a suitable evaluation metric. The output of an object detection algorithm is a set of bounding boxes and the probabilities assigned to each class for each object. The error that is measured must reflect the correctness of the predicted class (i.e. the one with maximum probability) and the “deviation” between the predicted and real boxes. More precisely, the accuracy of the prediction is measured in terms of the degree of overlap of the boxes. This degree of overlap is suitably measured by the Intersection of Union (IoU) metric, which is nothing but the area of the intersection between the predicted and real boxes, divided by the area of their union. During training, this number is provided as feedback to the CNN, and ends up being optimized through backpropagation. During testing, it is used only to calculate the confidence scores.
Apart from detecting and classifying vehicles correctly, a counting model requires each vehicle to be tagged with a unique identification ID, so that is counted only once. An object detection model is unable to achieve this, as it treats every frame as independent of the other. For instance, if it sees a car in the first frame of the video and again in the tenth frame, after it has moved some distance, its function is to correctly identify the object as a car and return its location in the respective frames. However, it cannot tell whether the car in the tenth frame is the same car as the one that it identified in the first frame. In order to introduce a “memory” in the model, the concept of trackers must be used. This model uses two types of trackers, both of which have different functions and work in conjunction with each other.
As the object detection algorithm runs on the video, most objects will be detected more than once. For instance, if the object detection algorithm runs once every second, and a car is in the camera’s view for 10 seconds, the car would ideally be detected ten times in ten different frames. (Even in a non-ideal situation where it is sometimes missed by the detector, it would still be detected many more times than once.) The function of the centroid tracker is to ensure that multiple detections of the same object are identified with each other, and therefore to ensure that the same object is not counted multiple times.
In order to do this, the centroid tracker uses distances. At every detection instance the newly detected objects are mapped back to the objects that are already detected and still in view, by matching all pairs of distances and comparing them to a set threshold. Out of the newly detected objects, those that are at the minimum distance from the old objects are sorted and identified with the respective previously detected objects if this minimum distance is below a set threshold. Newly detected objects that are not identified with old ones with this procedure are assigned new identities. This algorithm is also able to mark previously detected objects as presently disappeared, and after a fixed interval, to deregister them as they permanently leave the device’s view.
Therefore, the algorithm is based on the fact that movement of objects in a video is continuous: after a few frames, the object nearest to the old position of object X is most likely to be object X itself, and not some new object. Such an assumption may fail in the case of haphazard movement or densely packed objects. For the case of vehicular traffic, it works well enough for most situations.
The correlation tracker performs a function almost complimentary to that of the centroid tracker. While the centroid tracker plays its role whenever detections are taking place, the correlation tracker comes into action in the time gap where no detection takes place. As noted earlier, a video typically has 24-30 frames in one second. Therefore, it is unnecessary and wasteful to perform object detection on every frame. Typically, some fixed number of frames is skipped between each set of detections, and this number is set according to the average duration for which an object is in the camera’s view.
A correlation tracker takes object bounding boxes (rectangles suitably enclosing the object) as its input and tracks their movement in the next frames of the video. Using the initial object features, it creates a unique filter for each object to be tracked. The correlation tracker being used in this model creates filters that are robust to scale and orientation changes. In the next few frames, the correlation tracker uses this filter to selectively search for the object in the frame and obtain its new bounding box. This is done by applying an operation called correlation (similar to convolution) of the filter with different regions of the frame image. The filter is updated at every stage, as well, using the new appearance of the object once it has moved, in an “online” learning mode.
A correlation tracker, however, cannot be accommodated directly into tracking when multiple detections are performed. It picks up the initially detected objects and tracks them while they are in view. By itself, it cannot assign new identities to newly identified objects. In order to perform this task in this model, the centroid tracker is applied whenever detections are performed. The newly detected objects are therefore assigned new identities and their bounding boxes are fed into the correlation tracker. In this way, the two varieties of trackers work in conjunction and keep a neat record of all objects detected in the video.
The correlation tracker is more computationally expensive than the centroid tracker, since the former involves the computation of image correlations (which, in turn, approximately amounts to a lot of matrix multiplications), while the latter only involves a few distance calculations. The use of a correlation tracker within the model is one of the major hindrances to its speed, since applying YOLO v3 object detections every second allows for almost real-time speeds on a sufficiently powerful GPU. The correlation operations are, however, difficult to assign efficiently to a GPU. As a result, the speed suffers. One way around this problem is to do away with the correlation tracker and use only centroid tracking. However, for this, detections must be performed more often, and tracking is not nearly as efficient, so accuracy is compromised with quite a bit. As a result, the best deal is to use both trackers in conjunction, set to work at an optimal interval of time determined empirically.
Once objects are correctly detected and assigned identities using trackers, they can easily be counted uniquely. In this model, a threshold is set so that objects are only counted once they cross it. This is done for the following reason. We are more interested in counting vehicles moving in a single direction: towards the camera. The object detector often encounters certain issues when detecting far-away objects, which naturally appear smaller. For instance, there are more mistaken identities between cars and motorcycles, while tracking far-off objects. To reduce these errors, counting is done on objects when they are detected in nearer view. The tracking model can ensure that each detected object adds to the count of its class (car, motorcycle, bus, etc.) exactly once.
The average counting accuracy achieved still needs to be calculated exactly, but appears to be very good for all practical purposes. This accuracy will be calculated as the mean squared error between real counts (performed manually) and the model’s counts for different videos. The speed achieved is about 4 fps on a NVIDIA GeForce MX150, and 8 fps on a NVIDIA GeForce GTX 1080ti.
Advantages over the alternate version (former model)
An alternate vehicle counting model, which employed the Inception v3 model for object detection, had drawbacks in terms of both processing speed and counting accuracy. As already discussed, YOLO v3 has been built in a way suited to real-time object detection and is the state-of-the-art in terms of speed. In our case, it also gives more than satisfactory results in terms of vehicle detection and labelling. The introduction of object tracking also takes care of the flaws of the previous model in terms of counting detected vehicles.
The Inception v3 model performed at about 0.25 frames per second on a NVIDIA GeForce GTX 1080ti, when object detection was conducted at every 26th frame. This is explained by the large network architecture and the expensive regional convolution computations. YOLO v3 by itself runs at real-time speeds, given a powerful enough GPU. However, the new model is unable to achieve real-time speeds due to the presence of tracking algorithms, aside from detection. It has a processing speed (including detection, tracking, and counting) of about 4 fps on a NVIDIA GeForce MX150, and 8 fps on a NVIDIA GeForce GTX 1080ti.
In our case, YOLO v3 performs nearly as well as Inception v3 in terms of object detection, perhaps especially because of the smaller number of classes in the problem domain. The old model also suffered from the problem of inaccurate counting, since it employed a method dependent only on object detection. A pre-determined threshold line was set, and any vehicle detected beyond the threshold line added to the count by 1. The threshold was required to be at the end with respect to vehicle movement, to reduce multiple-counting. On the other hand, the detector might not have enough time to detect vehicles after they cross the threshold. So, vehicles going too slow may be counted multiple times, and vehicles going too fast may not be counted at all. This counting method would especially produce extremely erroneous results in the case of a traffic jam.
The new model employs object trackers to assign a unique ID to every detected entity, thus ensuring that every vehicle is counted only once. The dependence on the threshold method is thus done away with, and stationary traffic no longer poses a big problem. Objects are almost certainly detected multiple times but are counted only once due to the tracking algorithms.