PSOL for Weakly Supervised Object Localization

By Zekiye E., Vaishak K., Shivender K., Abdul H. D., Mayuresh S., Mohamed E., Salman M. and Arshak N.

‍

PSOL for Weakly Supervised Object Localization

What is WSOL?

Object localization and object detection are well-researched computer vision problems. The main task of these methods is to locate instances of a particular object category in an image by using tightly cropped bounding boxes centered on the instances. Supervised models which are using rich annotated images for training have very successful results. However obtaining detailed image annotations, e.g. bounding boxes, is both expensive and often subjective. On the other hand, weakly-supervised object localization (WSOL) methods have increasing popularity, since they are promising to train localization models with only image-level labels (e.g. class labels).

In the seminal WSOL paper; 'Learning Deep Features for Discriminative Localization’, researchers use CNNs and global average pooling to localize objects. They identify which regions of an image are being used for discrimination by using weighted activation maps. After this work, most of the research in this field generally focused on how to expand the attention regions to cover objects more broadly.

Here are some papers in benchmarks:

PSOL

The problem with these approaches is that localization is a byproduct of a classification model, so the main objective is indirect. Moreover, the classification model often tries to localize only the most important discriminative part of the object, not the whole object in the image.

We follow the methods in the paper ‘Rethinking the Route Towards Weakly Supervised Object Localization’. They present the pseudo supervised object localization (PSOL) method as a new way to solve WSOL. In general, they divide WSOL into two independent sub-tasks: class-agnostic object localization and object classification.

We choose to follow this method since they achieved 58.00% localization accuracy on ImageNet and 74.97% localization accuracy on CUB-200, which have a large edge over previous models. (ART methods have 65.22% top 1 localization accuracy on CUB-200.) Moreover, they claim that the PSOL method has good localization transferability across different datasets.

The main difference in the PSOL method is that after generating pseudo bounding boxes, they optimize the localization model on these inaccurate boxes by using regression techniques. They generate pseudo bounding boxes by using the Deep Descriptor Transform Method (DDT).

Here is an illustration of the method:

Taken from Rethinking the Route Towards Weakly Supervised Object Localization.

Algorithm:

Input: Training images Itr with class label Ltr

Output: Predicted bounding boxes bte and class labels Lte on testing images Ite

1. Generate pseudo bounding boxes ˜btr on Itr using Deep Descriptor Transform

2. Train a localization CNN Floc on Itr I tr with ˜btr

3. Train a classification CNN Fcls on Itr with Ltr

4. Use Floc to predict bte on Ite

5. Use Fcls to predict Lte on Ite

6. Return: bte , Lte

PSOL Model performance statistics

Deep Descriptor Transform

Deep Descriptor Transform is a co-localization (a.k.a unsupervised object discovery) method proposed in the paper called 'Deep Descriptor Transforming for Image Co-Localization'. The goal of image co-localization is to locate the common object in a set of unlabelled images. The powerful thing about the DDT algorithm comes from its simplicity. DDT can be very fast compared to other techniques. It works as follows, given a set of images containing a common object, the set is fed into a pre-trained CNN model. The descriptors (i.e. activations) of the last convolutional layer are then collected into a vector and averaged. The covariance matrix of these descriptors is computed alongside its eigenvectors. When sorted, the eigenvectors with the highest eigenvalue correspond to the feature of most relevance to the CNN. This principal component analysis process allows us to deduce which parts of the image the object is located in. By now drawing a minimum bounding box covering maximum components results in the bounding box prediction.

Results

We will show some of the results we achieved using the PSOL model. We ran two different experiments. First, we used the author’s pretrained models on imagenet to recreate the results achieved by the author. We also tried to train the model for specific datasets by using the weights of the pre-trained model as initial weights. These datasets belong to specific domains for example fashion, food, or medicine.

Results using pretrained models on Imagenet

In the table below, we show the results achieved by the imagenet pretrained model. We tested 3 different pretrained architectures (Resnet50, VGG16, and Densenet161). When testing the models on datasets including a single object in the image, we achieve an overall accuracy of above 85%. This showed that the model generalizes well on these datasets. However, when the model was tested with multi-label/multi-object datasets, the accuracy dropped significantly. This is due to the fact that our model is only able to produce a single bounding box. Therefore, when this predicted box is compared with more than one ground truth bounding box, the accuracy becomes very low.

Below are some of the generated images using the pretrained models, where the yellow or blue boxes are the predicted box and green is the ground truth bounding box.

Successful predictions:

Unsuccessful predictions:

Results using models trained on different datasets

We followed the PSOL method and trained models on the UECFOOD256 dataset and on the CUB-200-2011 dataset. Since the classification and localization part of the PSOL method are separate, we focus only on localization and did not train any classification models. We use average IoU and GT-Loc for accuracy metric instead of Top-1 Loc and Top-5 Loc. When testing those models, the accuracy achieved was not as good as the pretrained model on Imagenet. This can be due to the fact that Imagenet dataset is a huge dataset that contains many different objects and labels compared to our two datasets used here which only contains images of different dishes and different species of birds. We also experiment and train models for different datasets without ground truth bounding boxes like qmul_toplogo10, these models were not very successful.

Conclusion:

In this blog post, we explore the PSOL method to solve the WSOL problem. Our main goal is constructing a model which localizes object instances of one or more given object classes regardless of scale, location, pose and partial occlusions by using only class labels. In the paper, PSOL method is proposed to solve the drawbacks in previous weakly supervised object localization methods. Various experiments show that the method achieves a notable edge over previous methods. For example, the PSOL DenseNet161-Sep model can roughly match fully supervised AlexNet with Top-5 Loc accuracy. Furthermore, the PSOL method has good transferability across different datasets without any training or fine-tuning.
As can be seen from the resulting visualizations, PSOL performs almost as well as fully supervised methods when there is just one object in the image for example the images in CUB-200-2011. When there are multiple ground truth bounding boxes, the model fails as it can only provide one bounding box output per image. The model also fails to perform well when the object is not more salient.