1. Introduction
Self-driving cars or autonomous vehicles will have a huge impact on our society once the technology is deployed at scale [
1]. A camera sensor has become an ingredient component in the autonomous driving system. Using this sensor, the autonomous system will be able to perform multiple tasks critical to its autonomy, such as detecting pedestrians, lanes, traffic signs, or tracking multiple moving obstacles at the same time [
2]. Most important is the small package size and low manufacturing cost for the camera, allowing car manufacturers to deploy multiple cameras such as forward, backward, or side corners for environment perception.
To be safe, autonomous vehicles must be capable of perceiving the surroundings and all objects that move around. Multi-object tracking is about the accurate perception of the driving environment [
3]. So, multi-target tracking based on a mono camera is a key enabling technology for any self-driving vehicle system, in which it extracts the information from the raw camera sensor to constantly estimate the state of the moving object.
The challenge of 3D visual perception mainly lies in the following facts: (1) image is the projection of the real-world object; in the image plane in the projection transformation the distance information would be lost; (2) the size of the object on the image would change according to the distance; (3) it is hard to estimate the object’s size and distance. To solve the above challenges, some solutions are proposed such as (1) by integrating another kind of sensor, for example, Lidar; (2) by applying some geometry constraints; (3) by deploying a deep learning method to extract 3D information from an image; or (4) by using multiple cameras or stereo algorithm. Regarding 3D mono camera perception, the related work can be categorized into four types: (1) based on inverse transformation from 2D to 3D, (2) based on key points and 3D model, (3) based on the constraints between 2D and 3D, and (4) directly extract 3D information through deep learning.
In category 1, the typical method is BEV-IPM [
4], which transforms the 2D image into the BEV image with the assumption that the ground plane and vehicle coordinate are parallel with the Cartesian coordinates, and then feeds the BEV map into the YOLO network to detect the bottom line of the object. However, it is hard to guarantee the flat of the road. Another representative method is Pseudo-Lidar [
5], which transforms the image to 3D point cloud data based on the depth image and fuses point cloud and image to detect the 3D object. The core of this algorithm is depth estimation.
In category 2, the classical method is DeepMANTA [
6], which obtains the 2D bounding box through a deep neural network, 2D keypoints set, and the keypoints visibility and the similarity with the 3D model. After combining the 3D model with the 2D keypoints, the algorithm can output the 3D information. The disadvantage of this method is that it needs a 3D model of the object; the 3D model is limited as, in the highway scenario, there are many types of cars.
In category 3, the typical method is Deep3DBox [
7], which outputs the 3D information with the constraint condition that it can be found at least one corner of the 3D bounding box at the 2D bounding box edge. This constraint condition is modeled as a network layer. So, this method can be trained from end to end. In a real scene, it is hard to meet the geometry assumption. In category 4, there exist two types: anchor-based and anchor-free. In an anchor-based method, it produces a dense candidate 3D bounding box according to the prior knowledge and then projects the 3D bounding box to the 2D image. After scoring the 2D candidate bounding box, it outputs the final candidates. The typical method is Mono3D [
8], which produces the 3D candidate bounding box based on the prior position and size of the object. After projecting the 3D bounding box to the 2D image, it grades the 2D bounding box according to the feature from segmentation, size, and position and finally gives out the proposal. The biggest advantage of this method is that the dense anchor would burden the computation. Another classical method is TLNet [
9], in which the size and orientation of the anchor are defined. It applies the detection from 2D images forming the frustum to decrease the number of anchors. In the anchor-free method, the 3D information is regressed directly from the image. The representative method is FCOS3D [
10], which is similar to the 2D object detection with the addition of the 3D object regression head. The car 3D average precision of FCOS3D is about 11.8% with the IoU more than 0.7 on the KITTI bench, indicating that it is far from practical application.
The complexity of the system lies in the fact there are no perfect mono camera detectors and this means that it is susceptible to two kinds of errors: missed detections and false detection. Besides this, under the highway scenario, the number of moving objects in the field of view is unknown. It is difficult to know the state of the moving objects or where they are located and where they are going. Moreover, the multi-target tracking system is often restricted to tracking the objects that are inside the field of view of the mono camera [
11], and the kinematic objects may appear or disappear from the field of view. To overcome these challenges and at the same time keep the advantage of the mono camera, a multi-object tracking system based on the mono camera is designed in this paper. It is composed of three main modules: the object detector [
12], the depth estimator [
13], and the multi-target tracking [
3] engine.
The object detector module uses a deep learning approach to detect vehicles in mono camera images, which can obtain a set of bounding boxes around all vehicles in the scene. In the past, methods for object detection were often based on histograms of oriented gradient (HOG) [
14] and support vector machine (SVM) [
15] until the advent of deep learning or convolutional neural networks. Deep learning algorithms are now the state of the art in most computer vision problems, such as object detection in every self-driving system such as two stages objector RCNN family [
16] and one stage detector YOLO [
17] series or SSD [
18]. However, using deep learning adds additional constraints to the system because it requires more computational power. In an autonomous vehicle, one of the major concerns of the deep learning object detector is the speed. In contrast to other methods, the major advantage of the YOLO is its speed and it can be deployed in the autonomous system easily, so this paper adopts YOLO as the object detector of a multi-object tracking system.
In the mono camera system, every 3D object in the world is projected through the lens onto the image plane. The shortcoming of a monocular camera is that it will lose the depth information and cannot directly resolve this ambiguity; this paper resolves it through the depth estimator.
In the mono camera-based multi-target tracking system, another problem to deal with is the data association, which determines which measurement comes from which measurement. The Probabilistic Data Association (PDA) series methods are adopted in the literature [
19]. These methods share a similar procedure of data association, in that they first compute the probabilities of being correct for each validated measurement at the current time and then weight these probabilities to obtain the state estimate of the target. Another method is the Hungarian method [
20]. The Hungarian Algorithm solves the track and measurement assignment problem with the runtime complexity worst-case
. This paper adopts the global nearest neighbor with the gating trick, which is not only reduces the data association complexity but is also implemented easily.
In the algorithm of target motion state estimation, from the perspective of the highway driving scenario, this paper applies a linear Constant Acceleration (CA) motion model and non-linear observation model, since the camera measures the object in the image coordinates and needs to convert it to the ego-vehicle Cartesian coordinates. Extended Kalman Filter (EKF) [
21] is widely used in nonlinear filtering, in which exist some nonlinear factors.
Track management is another crucial problem in the multi-target tracking system, which refers to the track initialization, maintenance, and cancellations because the moving objects may enter or disappear from the mono camera sensor field of view [
3].
Due to the analysis of the above module, the mono camera-based multi-target tracking framework proposed in this paper is shown in
Figure 1.
The rest of the paper is organized as follows.
Section 2 discusses the deep learning-based object detector YOLO and the depth estimator based on the bounding boxes published by the object detector.
Section 3 talks about the kinematic transition model of the moving object and the mono camera sensor measurement model.
Section 4 analyzes how to apply the nonlinear filter approach Extended Kalman Filter in the mono camera tracking system. In
Section 5, a gating method combined with the data association method Hungarian is proposed. In
Section 6, this work adopts a simple track management policy. Finally, the performance of the mono camera tracking system is evaluated qualitatively and quantitatively.
The contribution of this paper can be summarized as follows:
A multi-target tracking system based on a mono camera is constructed, which can be used on the expressway scene
An object detector combined with a depth estimator is designed to resolve the mono camera depth lost problem.
The whole system is tested under the highway scenario and the performance of the lateral and longitudinal is evaluated qualitatively and quantitatively.
3. System Model
After obtaining the detections from the object detector and depth estimator, there exists the bounding boxes information, the moving vehicle type information, and the distance to the ego vehicle. So, the mono camera tracking system is about to study the estimation of time-varying parameters, that is, the state estimation problem which refers to smoothing the past motion state of a target, filtering the present motion state, and predicting the future motion state of a target [
17]. A typical forward-looking mono camera multi-target tracking system can be seen in
Figure 4.
In the highway scenario, this paper adopts the constant velocity motion model as the system state transition model. A system model of an object is represented by a Cartesian position and velocity components. The model assumes the motion of target vehicle with constant velocity in lateral and longitudinal direction and implements noise for the velocity components using two independent Wiener processes. The position, velocity, and acceleration can be expressed in the form as Equations (2)–(5):
Assuming that the state space is
and the process noise vector is
The corresponding state transition matrix and process matrix are respectively:
The front-facing mono camera is mounted on the wind window of the vehicle as shown in
Figure 5.
In the mono camera measurement system, the vehicles in the real world are projected to the image plane. In a multi-object tracking system, it uses an ego-vehicle coordinate system for tracking the moving objects where there exist three types of coordinates, namely: the ego-vehicle coordinate system, the mono camera coordinate system, and the Image coordinate system, as shown in
Figure 6.
As shown in
Figure 6, in the camera coordinate system, the
x-axis is the camera’s optical axis. The intersection of the optical axis and the image plane is called the image center or principle point. In the vehicle coordinate system for tracking, the
x-axis points forward, the
y-axis points to the left, and the
z-axis points upward. The image coordinate system usually has its origin in the upper left corner of the image. The pixel coordinates are denoted by
for the horizontal dimension and
for the vertical dimension. Note that the pixel is not necessarily perfectly square. Instead of a single focal length
, it may have two numbers
that might slightly differ. The image center
and the focal length
are derived through intrinsic camera calibration.
The mono camera measurement model can be defined as Equation (7):
In Equation (6),
is the 3D position of the vehicle in the real world. The vehicle is projected to the image plane. This is the mono camera measurement model
, the formula summarizes how to compute the image coordinates from a 3D object in vehicle coordinate. Projecting a 3D point to a 2D image plane space makes Equation (6) a nonlinear measurement function. Hence, for a mono camera, it needs to calculate the mapping to convert from Cartesian coordinates to image coordinates. So, the mono camera measurement equation can be defined as Equation (8):
In Equation (7), is the measurement vector, and is a white Gaussian measurement noise sequence with zero mean and covariance. As can be seen from Equation (7), the measurement function is nonlinear; in the next section, this paper will talk about how to deal with the nonlinear with the Extended Kalman Filter.
4. The State Estimator
The most famous state estimator is the Kalman filter [
18], which obtains dynamic estimation of the moving targets under the linear Gaussian assumption, but in many actual cases, the measurement function is non-linear, as shown in Equation (8). The usual approach to turning nonlinear filtering into approximate linear filtering is by using linearization techniques and then applying linear filtering theory to the suboptimal filtering algorithm for the original nonlinear filtering problems. The most commonly used linearization method is the Taylor series expansion, by which the filtering method of the Extended Kalman Filter is achieved [
19].
The mono camera measurement function
is composed of two equations that show how the predicted state is mapped into the measurement space, as shown in Equation (6). After calculating all the partial derivatives, our resulting Jacobian matrix
is defined as Equation (8):
So, after linearizing the measurement equation, the transition and measurement equation are both linear equations. So, it can use the Standard Kalman Filter to predict and update the track state in the mono camera-based tracking system. The Kalman Filter includes two steps: prediction and update, and the process is shown as Equations (10)–(16):
The prediction step is defined as Equation (10):
The state prediction covariance is Equation (11):
The updated state estimate is shown as Equation (12):
is the filter gain defined as Equation (13):
is called the innovation or measurement residual defined as Equation (14):
is the measurement residual covariance following Equation (15):
Finally, the updated covariance of the state at time
follows Equation (16):
In the mono camera multi-target tracking system, by iterating between the prediction and update steps, it can maintain the states of the tracked objects. This mechanism can be tuned by specifying if the system should rely more on the motion model assumption or the measurement by specifying noise parameters for both. Measurement noise is typically specified by the sensor manufacturer and is based on the physical characteristic of the sensor as to how accurate it is. Process noise is the parameter that accounts for unknown or unmodeled motion. The ratio of process noise to measurement noise determines whether the tracking system relies more on process versus measurements.
5. Data Association
Data association is about what is being associated with what. On highway driving scenario, data association decides which track to update with which measurement. The data association module calculates track and measurement pairs and tells which measurement probably originated from which track. For the association, two assumptions are made: each track generates at most one measurement and each measurement originates from at most one track. A simple approach is to update the track with the closet measurement. This paper uses the Mahalanobis distance as the metric for decision, the Mahalanobis distance is defined as Equation (17), where
is the measurement,
is the position, and
is innovation covariance.
To decrease the computational effort to calculate all possible distances, it does not make sense to calculate the distances of very unlikely and faraway combinations. By defining a gate or threshold to the Mahalanobis distance, for every possible association between a track and a measurement, it must be first checked whether the Mahalanobis distance is smaller than the threshold; if the distance is bigger, ignore this possible association. The gating trick is shown in
Figure 7.
If the measurement lies outside a track’s gate, the distance in the data association matrix is set to infinity as shown by Equation (18):
In data association, it is assumed that each track generates at most one measurement and each measurement originates from at most one track. Suppose there are N tracks and M measurements. The association matrix A is NxM matrix that contains the Mahalanobis distance between each track and each measurement.
There also need a list of unassigned tracks and a list of unassigned measurements. This paper looks for the smallest entry in A to determine which track to update with which measurement, then delete this row and column from A and the track ID and measurement ID from the lists, and repeat this process until A is empty.
When the data association module is updated with new set of detections from the mono camera, the tracker attempts to assign these detections to the existing tracks it maintains. The assignment has three possible outcomes: detection is left unassigned, detection is assigned to a track, and a track is left unassigned as depicted in
Figure 8. If the assignment gating or threshold is small, there is a chance that much detection is left unassigned, leading to the creation of many tracks. If it is too large, then an incorrect detection association may happen.