YOLOv2 Deep Learning Model and GIS Based Algorithms for Vehicle Tracking ()
1. Introduction
Vehicles tracking is an important subject with interesting applications. It has been extensively studied from different angles, using both classical methods of traditional object detection and GIS methods, based on GPS and real time communications tools.
As one of the most important tasks in computer vision, object detection is rapidly growing, thanks to the latest advances in deep learning based methods and computational power with clusters of graphics processing units (GPUs). This offers new opportunities for vehicle tracking, through the use of high-resolution satellite imagery and deep learning methods, based on Convolutional Neural Networks (CNNs) [1]. In this paper, for vehicle tracking purposes, YOLOv2 model [2], a fast growing open source CNN, is train on VEDAI images, an open dataset of vehicles imagery. GIS functionalities and LinkTheDotes algorithm are used for spatio-temporal tracks creation, control and visualization.
The plan of the paper is as follows. This section presents a literature review of some studies on vehicle tracking and object detection, with the basic concepts of Deep learning, CNN and YOLO. Section 2 presents the general approach, preparation of input data, YOLOv2 training and LinkTheDots algorithm as well as used GIS features. Section 3 examines the results obtained and section 4 provides some conclusions.
1.1. Vehicle Tracking
Vehicles tracking became an important task with important applications in many fields such as urban traffic monitoring [3], intelligent transportation systems [4], ground surveillance [5], driving safety and security [6], advanced driving assistance systems [7], etc.
While classical methods of vehicle tracking are based on the combination of GPS, GSM, GPRS and internet technologies [8] [9] [10] [11], new methods based on imagery and AI are rapidly evolving [12] - [17]. the advantage of these new methods is their ability to process data at large scales, without the need to first install special equipment in tracked vehicles; They take advantage of accelerated advances in artificial intelligence, especially deep learning, and thus significantly reduce the cost of access to these analysis data for the largest number of interested researchers and businesses.
1.2. Object Detection and GIS
Object detection consists of detecting instances of a certain class (such as vehicles, humans, or trees) in digital images. It is a computer vision subject, that finds numerous applications, in several fields such as facial recognition [18], autonomous driving [19], and lately face mask detection amid COVID-19 pandemic [20]. The main objective of object detection is to develop computational systems that deliver a key information to computer vision applications which is: “What objects are where”? [21], which is also the basis of multiple GIS (Geographic Information Systems) applications. The two areas benefit and complement each other [22] [23] [24].
1) Object detection and image classification
The objective of image classification is to extract existing classes of visual objects, without necessarily specifying their location in the image. It answers the question “what object is in the image?”.
On the other hand, object detection locates instances of classes on the image, with bounding boxes or bounding polygons [25] as shown in Figure 1.
2) Object identification
Object identification happens when the detected objects in the image are
Figure 1. Bounding boxes (left) vs. bounding polygons (right).
assigned unique identification codes. It is used for real-time object tracking applications for instance [26].
1.3. Image Processing with Deep Learning
1) Deep Learning (DL)
Recent research works show that Deep Learning methods have arose as powerful Machine Learning methods for object recognition and detection [27] [28] [29] [30] [31]. Deep learning happens with complicated nonlinearity, when composing many nonlinear functions [32]. While traditional approaches of Artificial Intelligence and Machine Learning make it possible to learn hierarchical representations corresponding specifically to the analyzed data [33], we tend to believe that with Deep Learning Neural Networks, there is an incremental evolution of the representation of raw data into categories of abstractions as the system is fed with data [34] [35]. Thus, with its boosted capacity to adjust billions of parameters thanks to massive parallelism computing capabilities, Deep Learning algorithms success in AI application such as image and video processing stands phenomenal [36].
2) Convolutional Neural Networks (CNN)
When dealing with images, unlike the traditional approaches, Deep Learning models learn the features immediately from the raw pixels, developing local receptive fields from lower layers to upper layers. For instance, lower layers recognize simple features like lines and corners, while higher layers extract complex features representing real life objects such as vehicles. The successes of DL in image processing are testified by the challenging ImageNet classification task across thousands of classes [30] [37] by using a kind of deep neural network called a Convolutional Neural Networks (CNN) [38].
The structure of CNNs was initially based on the animal visual cortex organization [39]. After a slow start in the early 1990s due to computing capacity limits [40] [41], CNNs experienced a huge boom with the rapid development of these capabilities with, among others, cloud computing.
CNNs are made up of several layers similar to feed-forward neural networks. The outputs and inputs of the layers are given as a set of image matrices. CNNs can be constructed by different combinations of convolutional layers (where convolution operation is done on specified filters), pooling layers, and fully connected layers (generally, before the output) with nonlinear activation functions. A typical CNN architecture is shown in Figure 2 [38].
3) Single Shot CNN: YOLO
You Only Look Once (YOLO) is a Convolutional Neural Network object detection system, that handles object detection as one regression problem, from image pixels to bounding boxes with their class probabilities. Its performance is much better than other traditional methods of object detection, since it trains directly on full images.
YOLO is formed of 27 CNN layers, with 24 convolutional layers, two fully connected layers, and a final detection layer [2] (Figure 3).
YOLO divides the input images into an N by N grid cell, then during the processing, predicts for each one of them several bounding boxes to predict the object to be detected. Thus, a loss function has to be calculated. YOLO calculates first, for each bounding box, the Intersection over Union (IoU); It uses then sum-squared error to calculate error loss between the predicted results and real objects. The final loss being the sum of the three loss functions: 1) classification loss: related to class probability, 2) localization loss: related to the bounding box position and size and 3) confidence loss measuring the probability of objects in the box [42].
Figure 2. Convolutional neural networks architecture [38].
2. Methodology
In order to generate vehicles temporal paths in GIS format from aerial video, a three steps process is adopted:
• To solve the problem of handling continuous aerial video stream, which represents a big technical challenge [43], the video stream is converted into a series of images, with a suitable resolution for the trained YOLOv2 algorithm.
• Each individual image is then processed with YOLOv2 algorithm trained beforehand.
• With LinkTheDots algorithm, the detected vehicles are then tracked throughout the output series of images, generating a specific GIS dated path for each vehicle.
Figure 4 shows the general process, and Figure 5 presents the process of YOLOv2 algorithm training (LinkTheDots algorithm process is detailed later in this section).
2.1. Input Data: From Areal Video to a Series of Images
From an aerial video of a busy parking lot [44], the series of frames was extracted. Figure 6 presents one of the extracted images.
Figure 6. A frame from the series of extracted images from the areal video [44].
The metadata of each frame contains the detailed date of the image, which is inherited by all detected vehicles on the frame.
At this stage of the study, the set of images are ready to be processed one by one, with the trained YOLOv2 algorithm for vehicles detection.
2.2. YOLOv2 Algorithm Training
1) Training data
YOLO and CNN algorithms in general, when applied on imagery data, can be trained with data from anywhere and applied with the same degree of certainty elsewhere [45]. For this reason, in the absence of local data sources of areal imagery, VEDAI (Vehicle Detection in Aerial Imagery) data source [45] is used. In addition to its open access and the important number of offered images (more than 10,000), VEDAI database offers labels for each vehicle, ready to use for recognition algorithms trainings Figure 7.
The YOLOv2 model was trained and tested with a set of images of 1024 × 1024 resolution. Overall, a dataset of 1200 images were used; 70% of them as training data and 30% for tests.
2) Training platform
YOLO algorithm training, like all deep learning models, requires considerable computing capacity [32]. Therefore, the used platform was in the cloud with the configuration specified in Table 1. One of the most important aspects of this configuration is the high performance GPU (Graphics Processing Unit), as it has an efficient parallel architecture for model learning. Combined with clusters or cloud computing, it considerably reduces network training time.
Darknet [46] was used as a training framework; it is an open source Neural Network framework written in C and CUDA that supports CPU and GPU computation.
Environment | Amazon AWS |
Instance type | p2.xlarge |
System | Windows Server 2016-X64 |
Processor | 4 CPU Intel Xeon E5-2686 V4 2.30Ghz |
RAM | 61Go |
GPU | 1 GPU NVIDIA TESLA K80 with 12Go in memory |
HDD | 100Go |
翻译:
Table 1. YOLOv2 training environment specifications.
2.3. LinkTheDots Algorithm
In order to track the same vehicle throughout successive frames, LinkTheDots algorithm was developed. Its main task is to link the centroid of a vehicles bounding box on a certain frame, to the centroid of the same vehicle’s bounding box on the next frame. This would indicate that, between the two frames instants, this particular vehicle has moved from the first point to the second.
After all the frames are processed with the trained YOLOv2 algorithm and all bounding boxes are generated, all vehicles’ centroids are created with GIS tools. LinkTheDots algorithm processes then all of these resulting frames, starting with the first, where all points should be identified by a vehicle’s ID. From there, starting with the second frame, the algorithm must check if the associated vehicle has already been identified in the previous frame in order to obtain its ID, otherwise, a new vehicle’s ID must be attributed. Figure 8 shows the detailed process of LinkTheDots algorithm.
LinkTheDots identifies the position of the vehicle position in the previous frame by performing a geographic search, within a distance of Δmax, beyond which, no vehicle would ever be able—supposedly—to move between two frames time, given the assumed parameters such as maximum vehicle speed. Therefore Δmax is considered as an algorithm adjustment parameter.
3. Results and Discussion
3.1. YOLOv2 Algorithm Training Results
Here below, in Table 2, the main parameters of a YOLOv2 training:
Figure 8. LinkTheDots algorithm process.
Parameter | Indication | Target |
Avg IOU | Average “Intersection over Union” IOU = Area of Overlap/Area of Union The two areas are: “the predicted bounding box” and “the ground truth of target object” | 100% |
Avg recall | Average “Recall” = Recall/Count The ratio of the number of detected objects to the total number of objects to be detected | 100% |
Count | The total number of objects to be detected in the current set (number of originals) | - |
Number of iterations | The number of iterations | - |
Average loss | The average loss error | As low as possible |
Total time | The total time spent processing this batch | - |
翻译:
Table 2. YOLOv2 training output parameters.
In Table 3, the results of the YOLOv2 training are presented: images resolution, dataset size, beginning of convergence, number of iterations, average loss and training duration.
The evolution of the average loss during the iterations of the learning process is presented in Figure 9 and test results illustration is presented in Figure 10.
Training images resolution | Number of images | Beginning of convergence | Number of iterations | Average loss | Training duration |
1024 × 1024 | 1200 | 802 Iteration | 22,000 | 0.05 | 7 days |
翻译:
Table 3. Parameters and overall results of YOLOv2 training.
Figure 9. Evolution of the average loss according to the number of iterations.
The model detected 91% of test vehicles. These results show that the trained model can identify vehicles with satisfactory accuracy that meets the intended application requirements for spatio-temporal tracking. With a larger set of training images, this accuracy can be significantly improved.
3.2. Vehicles Tracking Results
The results of the trained YOLOv2 algorithm and the processing of the output data (Figure 4), are 1) the table of positions of moving vehicles, produced by LinkTheDots algorithm, an extract of which is presented in Table 4; And 2) vehicles’ positions throughout the input areal video time, shown in Figure 11.
Point ID | Frame | Vehicle ID | TimeStamp (ddmmyyHHMMSS) |
1503 | 7 | 8 | 08052018133553 |
1489 | 6 | 8 | 08052018133551 |
1202 | 5 | 8 | 08052018133549 |
835 | 4 | 8 | 08052018133547 |
652 | 3 | 8 | 08052018133545 |
354 | 2 | 8 | 08052018133543 |
193 | 1 | 8 | 08052018133541 |
翻译:
Table 4. Excerpt from LinkTheDots space-time table of moving vehicles.
Figure 11. Generated centroids throughout time.
Using GIS tools to convert collections of points to lines, these points were converted into circuits, sorted by vehicles’ ID numbers. Thus, the spatio-temporal tracks of moving vehicles in the areal video were obtained (Figure 12).
3.3. The LinkTheDots Algorithm Limits
LinkTheDots algorithm is based on the assumption that the nearest bounding box centroid in the following image is related to the same vehicle. The algorithm parameter Dmax must be then set to a value that avoids confusion between two different vehicles on two successive frames.
Let:
D: The vehicle’s travelled distance between two frames
Wvehicle: The vehicle’s width
Then, to avoid confusion between vehicles, we must have:
∆ < Wvehicle (1)
Figure 12. Veihicles GIS spatio-temporal tracks.
This means that Dmax must be less than the minimum vehicle’s width.
Let:
Vvehicle: The vehicle’s velocity
Vcamera: The velocity of the camera
Fr: the number of frames per second (frames’ rate)
Then:
Δ=Vvehicle−VcameraFr (2)
From (1) and (2):
Vvehicle−VcameraFr<Wvehicle (3)
This implies that, in the case of a static camera (Vcamera = 0), for an average vehicle width of 2 meters and a camera frame rate of 15 frames per second, the maximum velocity up to which a vehicle can be tracked is 30 m/s (108 km/h).
Another implication would be that if it is intended to track a vehicle with a velocity of 150 km/h—still with a static camera—the used camera should have a rate of 21 frames per second or better.
4. Conclusion
In this work, YOLOv2 model was trained for the detection of vehicles on aerial images. The trained model was coupled with LinkTheDots algorithm for GIS spatio-temporal tracking. The limits and the conditions of validity of the proposed algorithm were discussed according to the frames’ rate in the raw aerial video and the speed of the tracked vehicles. The accuracy of the trained model which was found around 91% can be significantly improved, with a larger set of training images.