In this section, we introduce our method IRSTFormer in detail, a vision transformer-based method for infrared small target detection.
3.1. Network Architecture
As shown in
Figure 2, our method belongs to segmentation-based infrared small target detection. Given an image of
, the network classifies each pixel in the image into target or background, and finally outputs the corresponding segmentation mask.
In order to reduce the false alarm in complex infrared images more efficiently, we propose a hierarchical vision transformer HOSPT to extract multi-scale features of , where , , , , . Different from recent CNNs, the self-attention layers in the transformer can learn the dependency relationship in the range of the whole image. This is essential to suppress background interference in complex images. The shallow features contain more target location features that help to locate the target in the image. Deeper features, on the other hand, contain richer semantic features that help to distinguish between false alarms and targets. Therefore, for the decoder, we present the TFAM. In each TFAM, adjacent features are firstly aggregated in the order of top-down. After that, we utilize the channel attention to refine the fused feature. Getting the predicted segmentation mask, we utilize the CBS loss to optimize the network.
3.2. Hierarchical Overlapped Small Patch Transformer
Among the existing deep learning methods, deepening CNNs are used to extract features from infrared small target images, but these methods are always limited by the locality of convolution, resulting in the poor ability of modeling long-range dependencies in the images. With the increase in size and FOV of infrared detectors, this deficiency is more likely to lead to detection errors. Therefore, we design a transformer-based encoder HOSPT for feature extraction.
At the beginning of each stage, we design the OSPE to dived the input feature map into different patches and conduct linear projection to obtain the two-dimension feature embedding. During this process, the OSPE also completes the down-sampling of feature maps to realize the multi-scale feature extraction. After that, the dot-product self-attention layer can explicitly model the dependencies between different image patches. The extracted features define the importance of how each patch is similar to other patches in the input feature map.
Figure 3 shows the structure of every stage.
Every stage consists of four parts: OSPE, self-attention layer, feed-forward network (FFN), and layer normalization (LN). One self-attention layer, one FFN, and two LN constitute one transformer block. Each stage consists of two transformer blocks.
After experimenting with different parameters of the OSPE, we set the patch to
and the stride to 2, which means there is an overlap of three pixels between adjacent patches. Compared with ViT [
18], the overlap preserves the continuity between different patches. Specifically, for input three-dimension feature maps of
, it is firstly divided into
patches of
, where
. Then, each patch is flattened and projected linearly into
. Finally, the output two-dimension feature embedding has the size of
.
The self-attention layer aims to capture the long-range dependency of every patch pair. As shown in
Figure 3, given a feature map, the network learns three sets of parameters to project the features (
F) of
to query (
Q), key (
K), and value (
V). Then, the weight is obtained by similarity calculation of the query and the key. Common similarity functions include dot-product, splicing, perceptron, etc. The softmax function is used to normalize the weight. Finally, we multiply the weight with the corresponding value to obtain the final attention features. The attention features define the importance of how each patch is similar to other patches in the feature map. For the original standard multi-head self-attention, it makes
Q,
K, and
V have the same size of
and calculates the self-attention in the form of dot-product with the following equation:
where
is the dimension. We can see that the computational complexity is quadratic with the size of the feature map, which is prohibitive for large size images. Therefore, the spatial reduction is applied to
K and
V, which can be formulated as:
K, V of is firstly reshaped into the size of . Then, the linear projection is utilized to restore the number of channels from to C. After such options, we obtain and of . As a result, the computational complexity of self-attention is reduced from to .
In the FFN, the
convolution is utilized to replace the position encoding. Therefore, the encoder is robust to different sizes of input images as generally found in the segmentation task. The FFN can be formulated as:
3.3. Top-Down Feature Aggregation Module
After obtaining the features of four scales, we should aggregate them in a suitable way. In the U-Net, the transpose convolution and the shortcut are utilized to fuse adjacent scaled features. However, this design will double the number of parameters in the network. Considering the number of parameters in the transformer is already more than that of the CNN, we adopt the simple design of the feature pyramid network (FPN) [
44]. In the original FPN, features at different scales are fused by linear addition. This unweighted fusion approach may lead to redundancy of information. Therefore, highlighting important features and suppressing useless features is a more appropriate way to aggregate.
We present the TFAM to form a progressive decoder. As shown in
Figure 2, according to the top-down order, the features of adjacent stages are fused to obtain the final pixel segmentation mask. The structure of TFAM is shown in
Figure 4, during the fusion, the MLP is first used to unify the dimensions of different scaled features. Then, upper-level features are up-sampled and concatenated with lower-level features along the channel dimension. After that, we utilize a convolution layer of
and a ReLU function to reduce the dimension and obtain fused features. Finally, channel attention is used to refine the fused features.
Taking features of
as the input, we firstly utilize the global pooling for shrinking the feature maps to obtain channel-wise statistics. Next, channel attention, which explicitly models the global information among channels, is obtained after two linear functions and two activation functions. Refined features can be obtained by multiplying the channel attention and the input features. In this way, useful features can be highlighted while useless features can be suppressed. The overall process can be formulated as:
where
F means the input features of
,
means the channel-wise statistics,
means the ReLU function,
means the sigmoid function,
means the channel attention, and
means the refined features.
3.4. Loss Function
Infrared small target detection can be seen as the binary classification of the input image, where each pixel is distinguished as the target or the background. LSPM [
29] utilizes the binary cross-entropy (BCE) loss function when training.
where
n is the batch size,
G is the ground truth, and
P is the predicted segmentation mask. However, the pixel area of small infrared targets is extremely small. In our test images, the small target has a pixel share of less than 0.03% (
). Due to the severe imbalance between positive and negative samples, when training, the network that is supervised by the BCE loss can tend to output zeros because even then the loss function is not very large. In other words, the target is overwhelmed by the background. Secondly, there is no prioritization between the target and background, and all pixels in the image are treated equally. At last, the loss of each pixel is calculated independently, ignoring the global structure of the image.
To obtain a better model, we expect the network to focus more on the target region, rather than treating all pixels equally. Intersection over union (IoU) is usually used as the metric for image segmentation, so an intuitive idea is to directly use IoU as the loss function [
45]. In ALCNet [
26], AGPCNet [
28], and DNANet [
46], the softIoU loss function is utilized for infrared small target detection, which is defined as
where
n is the batch size,
G is the ground truth, and
P is the predicted segmentation mask. However, when supervised by the softIoU loss, our network cannot converge, resulting in no target can be detected. We analyze this phenomenon from the perspective of the gradient.
For analysis, we assume that the network performs the single point output. Consequently, the following equation is used to calculate the loss value.
where
x is the output,
represents the probability value of a pixel being the target,
represents the ground truth of the pixel, among which 0 means the background and 1 means the target, and
is the smoothing factor, which is a very small value.
Using the chain rule, the gradient of the softIoU loss is as follows and shown in
Figure 5.
Figure 6 shows the network output values
x at the first and middle epoch of the training. Because of the weight initialization, the network output values at the first epoch are concentrated around 0. As shown by the gradient diagrams, the absolute values of the gradient at this time are close to 0 for the negative background sample (
) and the maximum value for the positive target sample (
). This indicates that the contribution of the background region to the network is much smaller than that of the target region, which means that the network is more concerned with finding the target. This tendency will make the network segment more pixels and reduce the missed detection of target pixels. Entering the middle epoch, the network outputs negative and positive values for the predicted background and target regions, respectively. As can be seen from the gradient diagrams, the gradient values at this time both tend to be close to 0. Therefore, the network parameters tend to be updated slowly. Since it has entered the gradient saturation zone, if there are false alarms or missed targets at this time, it will be difficult for the network to overcome these errors, that is, the network is less sensitive to errors.
On the other hand, during the training, the value of the loss function varies with the change of the segmentation mask. To ensure smooth training, large changes in the value need to be avoided. In the softIoU loss, a smaller target pixel area (denominator) leads to a larger change in the loss function values corresponding to the same prediction change (change of the numerator), further leading to a drastic gradient change. Once it enters the saturation zone in the early stage of training, it will make the network difficult to converge. Therefore, compared to the generic instance segmentation, a single softIoU loss leads to an instability of the training when the network performs infrared small target segmentation.
We also analyze the BCE loss function from the perspective of gradient. The following equation is used to calculate the loss value.
where
x is the output,
represents the probability value of a pixel being the target,
represents the ground truth of the pixel, among which 0 means the background and 1 means the target. According to the chain rule, the gradient of the BCE loss is as follows.
We can observe that the gradient of the BCE loss is equal to the prediction error. The positive and negative samples contribute to the gradient equally. In the early training period, the prediction error is relatively large. Therefore, the gradient value is large. The network parameters can be updated quickly. Entering the latter period, as the prediction error decreases, parameters are updated slower and the network gradually converges to a stable state.
Based on the analysis above, we propose combined BCE and softIoU (CBS) loss, as formulated below:
where
,
. It consists of the BCE loss and the softIoU loss. The former mitigates category imbalance, while the latter can provide smooth gradient values. Inspired by Libra RCNN [
47], we utilize the natural logarithm to balance the values of two parts. In
Section 5.3, we exploit different forms and parameters of the combination. Compared with the weighted addition, the natural logarithm has the ability to adaptively adjust two loss functions at different epochs of the training. The gradient of the CBS loss is formulated as
At the early epoch of the training, the value of is relatively large. Due to adjustment coefficients and , the contribution of to the gradient is reduced, and the gradient mainly comes from . The network focuses more on the target area. When the training enters late epochs, the value of becomes small. Due to the effect of adjustment coefficients, the gradient contribution of is increased. Even if the error occurs, resulting in the saturation of , still can supplement enough gradient for network parameters to continue iterating.