Posted on

December 19, 2020

The Usage of CycleGAN for Image Translation to Increase the Size of Fridge Food Types Dataset

By Arshak N., Aishwarya S., Arun S., Sanket T., Lyubomir C. and Mohammed E.

‍

Abstract

CycleGAN models are used for image-to-image translation tasks like translating photographs of horses to zebra or apples to oranges, and the reverse.

‍

We investigate model architectures, loss functions and training sets to determine the optimal combination for translating blender fridge images into a more realistic version of a fridge. We then compared original CycleGAN model [5] against a modified version of CycleGAN [4]. Using this modified CycleGAN version, we discovered that the feature of the real in-domain images are better extracted as well as the quality and colors of the outputted images are more realistic.

‍

Introduction

The goal of this project is to develop an object detection model that can detect 20 classes of food items from the images captured by in-fridge cameras. There are many challenges that differs from a simple object detection, such as Occlusion of objects among each other because of tight arrangement, Low light situation and sort of fish-eye view because the camera needs to cover a large portion in side fridge. All those can be seen in a sample image 1. The algorithm will need to account for occlusion of objects within the images and other issues mentioned above.

‍

Dataset

Figure 1.Original image taken from a fridge camera

‍

Figure 2. Data gathered and manually annotated so far(includes in-domain and out-domain both

Data Sources• From the wild images – results of manual search

• Images from various publicly available datasets

• Images collected via Web Scrapping Annotations

Annotations done were then exported and converted to COCO Format using our own scripts or using RoboFlow, since the DETR model requires input annotations only in COCO

‍

Object detection using DETR

Figure 3. Detection Transformer Architecture

Inspiration for the model was derived from the paper [1]. They also provided the implementation of this architecture using pytorch framework here.

We decided to go along with this model architecture as this is an end-to-end Detection model unlike other rmodels that detects object in two step(find all possible bound box and then classify them). ThereforeDETR is ideal for quick prototyping with a competitive performance to traditional object 7/8 detectors. So we tried to run this model for detection on the data that we gathered earlier and we got the followingresults shown in Fig. 4.

We have ran it only with this configuration, which resulted AP(50) of 46 for 20 classes:‍

Resnet50 Backbone (DC5) with standard LR, with scheduled LR drop at epoch 20 for 35 Epochs

It’s clear that the model isn’t acting what was promised in the references. The biggest reason is the difference of data. The state of the art experiment results published were based on the experiment on the data with tens of thousands of images to train model, whereas we are just managed to get hardly thousands of images, where 700 are in domain images.

We have tried to cover-up the imbalance of images to train for each class by adding object images out off ridge. We added numerous objects - tomatoes, eggs, lemons etc., but that didn’t show any improvements. This showed that adding few more(a hundred or two) images won’t make much difference unless we add a significant number(preferably in-domain) of images for each object and reach at least that tens of thousands mark.

SYNTHETIC DATA GENERATION

Even after spending considerable amount of time and effort in data curation from the above Data Sources, the dataset size wasn’t sufficient enough to make a satisfactory prediction via DETR model. This lead us to think of options to augment the dataset with synthetic images. There were already some work done in this particular approach whose achievements are as follows:

Food det: Detecting foods in refrigerator with supervised transformer network (Zhu et al): AP of74% for 80 classes 50k images ( 500k annotations) manually created with camera+fridge
Deep Learning Performance in the Presence of Significant Occlusions (Koporec et al): AP of83% for 10% fridge fill, 43% for 30% fridge fill (natural images in the 30-50% range largely) 95ksynthetic images created using 3d CAD models + cycle renderer

Figure 4. Confusion Matrix for the DETR results of data collected so far

Since we need to add images using artificial means, we went for Blender because of being an open source application and its tightly integrated python APIs. We had to do a lot of work over the API, since they were meant for designers and not developers. we built our own high-level API over that integrated one that overcomes the gap between developers and Blender Experts for most part.

The initial idea was to generate images using actual 3d models and render their images in 2d because data augmentations wasn’t helping much in our project and Image generations using deep learning models can’t create new images on their own. All that leads us to make our own fridge as a CAD model and render it along with placed different objects inside.

In the last step of synthetic data generation, we needed to convert these rendered images into more naturally looking images, as these models aren’t the best representation of actual object in real life, at least not those models which are free. So following the research already done by others, we were hoping to use some style transfer to make the synthetic images look more natural, whose experiments are discussed in8 CycleGAN Models and Results.

Blender to render 3D environment

Our blender automation module will import a fridge 3d model along with other food item’s models that we could find for free on open source platforms like GrabCad, Free3D and other cad model freeware website. It then create an images by choosing different number of objects at random and set their location to a random point in 3d space, set their orientation at random following uniform probability distribution along with the constraints that we configured.

Figure 5. Blender pipeline: Left most figure is from the actual blender GUI environment, we need some values like height of 3 shelves, that we take by manually checking the heights in that environment. figure in center is the output after the automation of placements of object randomly and the last fig is output image with annotation drawn on it.

We have documented the whole pipeline (described in Fig. 5 ) of blender on P2. In case you want to reproduce or make your own automation with the help of our high-level API, for module’s internal specifics we recommend the module documentations.

Figure 6. Some more sample blender images with manually drawn annotations generated from blender automation modules

Apart from bound-box annotations blender can also generate segment annotations and mono-ocular annotations that we integrated with our blender automation module. The output of those annotations are illustrated in Fig. 7

Figure 7. left most image illustrates that a users could give any grey-scale color they want for any object in the scene, image in the center is the RGB image generated from blender rendering process which we are trying to use for object classification training after making them a bit more natural in post blender rendering step , and finally right most image is depth map for the scene(light color means closer to the camera and dark means the opposite) *Note: These annotations aren’t being used in our project but they aren’t also explored yet, if they could provide some improvement in any way or not*

Synthetic to nartural: Style Transfer

Synthetic image generation by Blender was a promising choice to expand the dataset in order to achieve an acceptable accuracy from the DETR model prediction. The only caveat that we noticed however was that the Blender output images had failed to capture the features of the interiors of a fridge in a photo-realistic manner. Style transfer / Domain transfer using GANs, for such Blender generated images seemed to be the next obvious route to take in order to expand the dataset. The various GANs that we experimented with were:

Style GAN [3]
Pix2Pix GAN [2]
CycleGAN [5, 4]

Out of the above 3 routes, CycleGAN gave better results when compared the other two and the pipeline of the same is being elaborated here in detail.

Image-to-image translation involves generating a new synthetic version of a given image with a specific modification, such as translating a summer landscape to winter. Training a model for image-to-image translation typically requires a large dataset of paired examples. These datasets can be difficult and expensive to prepare, and in some cases impossible, such as photographs of paintings by long dead artists.

CycleGAN is a technique that involves the automatic training of image-to-image translation models without paired examples. The models are trained in an unsupervised manner using a collection of images from the source and target domain that do not need to be related in any way. This simple technique is powerful, achieving visually impressive results on a range of application domains, most notably translating photographs of horses to zebra, and the reverse.

In the upcoming sections, We investigate different CycleGAN techniques for unpaired image-to-image translation.

Dataset & Augmentations

The objective needed from CycleGAN was to convert blender images (Domain A) to more realistic imagesof fridges (Domain B). Some examples of domain A and domain B datasets can be seen in Figures 8and 9 respectively

.In total, 534 images were created for domain A images and 381 images for domain B. Domain B images were collected from the internet and cleaned up. Domain B images contained images of fridges stacked up with different kind of food, but also contained images of empty and almost empty fridges as well. Adding empty fridge images improved the quality of the output result as the model was able to extract some of the feature of an actual fridge easier when it was empty. Keep in mind that using only empty fridge images would result in a white output images which was not the desired output.

Figure 8. Example images of blender fridges.

Figure 9. Example images of real fridges

The augmentation techniques used during training were done by resizing all images to a standard size of512x512 and doing random crop resizing of 128x128. Initially training was done by inputting the entire image. Experiments were done on different sizes of 128, 256 and 512. Other experiments were conducted using different cropping sizes of 64 and 256. However, using the augmentation technique mentioned at the beginning showed the most promising results,

CycleGAN models and Results

In this section, the parameters of the CycleGAN models used are discussed as well as the results associated with each model used.

Original CycleGAN

For the original cycleGAN parameter implementation, Table 1 shows the parameters used. Two different image sizes were used to see if it would improve output image quality.

Figure 10 below shows some of the examples predicted by the model.

Figure 10. blender to in-domain fridge images using original CycleGAN

Modified CycleGAN

While results of original CycleGAN are great in many applications, the pixel level cycle consistency can potentially be problematic and causes unrealistic images in certain cases [4]. In our blender to fridge example, the output of CycleGAN was not as was hoped. Hence, the use of the modified cycleGAN parameters. The parameters used can be seen in Table 2.

Figure 11 below shows some of the examples predicted by the model.

Figure 11. blender to in-domain fridge images using modified CycleGAN

There is a clear improvement in terms or transferring the objects inside the blender fridge to the output. Despite that improvement though, the output image still looks similar to a blender image compared to an actual fridge image. However, this can be improved with further data collection of real in-domain fridge images as well as introducing some augmentation techniques to the training dataset.

Test new data with DETR

To measure the effectiveness of using the Generative Adversarial network for domain transfer of Blender synthetic data, DETR [1] object detection algorithm was trained and evaluated on the collected data and on the collected data with the synthetic images. We combined initial sets of synthetic images with collected in-domain and out of domain images to train DETR, while not including any synthetic images in validation dataset. Training DETR with dilation for 35 epoch, with learning rate drop at 20, showed marginal improvement by including synthetic images in the training set. This let us conjecture that even with a minor domain shift towards our fridge images, could affect positively the results.

Table 3. Including blender images in the training set of DETR

Domain shift visualization

Domain shift can be visualized by projecting high-level features of the images of different domains to 2dspace. Figure 12 visualization is obtained by extracting visual features from different domains, reducing dimensionality with PCA and then running t-SNE algorithm on top of that. t-SNE has a quadratic time complexity, thus reducing the dimensions before running it could sensitively reduce computation times for larger data. To encode images, we extract visual features from layer fc2 of VGG-16.

As it can be concluded from the projection in Figure 12, the images that are parsed from style transfer module in hoping to make them look more realistic have their own cluster in visualization. It depicts that these images are not similar to the other images(scraped from web, Arshak’s fridge images). Although similar domains should closer to each other on the graph, it is difficult to determine whether the domain shift has made the synthetic images more similar to collected data. What could be observed is that the distribution of GAN produced images has expanded, arguably getting the distribution closer to the original images.

So far CycleGAN process described in the last section have failed to make synthetic images look more realistic to the point where we can use them for training our object detection model.

Future Work

Further investigations using CycleGAN can be done. As cycleGAN is parameter sensitive, different combinations of parameters can still be tried out. In addition to that, further data collection is necessary to improve both quality and feature extraction of the cycleGAN model. Somewhere around an extra 1000images for both domain could definitely boost the performance of the model. However, proper cleaning is essential to guarantee further improvement. Furthermore, the recently published blog of NVIDIA usingStylegan2 with ADA seems very promising. They use various augmentation techniques to increase the size of their data drastically which increases the performance of GANs models. Finally, adding the output of cycleGAN to the DETR’s training dataset could potentially improve the detection capabilities of the DETR model. Further adjustments can be made to the cycleGAN hyper-parameters to shift the domain of the output images closer to the real in-domain fridge images.

REFERENCES

[1] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end objectdetection with transformers, 2020.

[2] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarialnetworks, 2018.

[3] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarialnetworks. 2019.

[4] T. Wang and Y. Lin. Cyclegan with better cycles.

[5] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistentadversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages2242–2251, 2017.