A case for custom deep learning solutions
By Arshak N., Vaishak K.,
A case for custom deep learning solutions
AWS Rekognition and Azure Computer Vision module are two very popular and widely used deep learning platforms for Computer Vision. Both of these platforms can identify a wide range of objects like natural objects, human body and face, household things, animals, birds, vehicles, gadgets, food items and more, in an image with very good accuracy. Their ability to identify a variety of natural and man-made objects at pretty decent accuracy, with models that work right out of the box without the need for any training, make them a good choice for many deep learning computer vision applications. But are they good enough for all kinds of real-world applications or does specialized deep learning solutions have a case?
In this article we try to answer this very question in the context of identifying raw food items placed inside an oven. Identifying raw food items is one thing, trying to identify them placed inside an oven can be a whole different ball game - presence of artefacts like oven grill, peculiar angle of view and specific lighting conditions can all add to the challenges for the classifier.
The dataset used for the evaluation had 665 images of food items placed in an oven spread over 46 different classes. All the images were collected from the open internet, either through google image search or from YouTube videos. The dataset was very carefully curated to make it as representative as possible. Each image was selected such that it contain a food item placed in an oven with oven lighting and at a constant viewing angle. A few examples of images in the dataset is shown in figure (i).
Microsoft Azure Computer Vision
Azure Computer Vision module has an image classifier that can identify multiple objects that may contain in the input image. The model, apparently, is pre-trained on a huge dataset and can successfully identify many number of real and man-made objects in images. The model is capable of identifying many food items as well, but it is not clear which all food items it is capable of recognizing. From our experiments it was evident that Azure’s model couldn’t identify some of the classes in our dataset. To add to its vows, some our classes were too specific, for example our dataset has chicken breast, chicken thighs, chicken wings and whole chicken as different classes while Azure has just a generic ‘chicken’ class.
Besides, the model is a multi-label classifier that spits out more than label for each image, ideally one label each for each of the object in the image. But in reality the results are far from ideal, as demonstrated in figure (ii).
For the first image the model returns labels that are quite accurate and more importantly it successfully identifies the presence of chicken in the image, which can be considered as a positive outcome. But for the second image the model totally fails to recognize the chicken, which is a negative outcome. After feeding all the 665 images in the dataset to the model, it returned an accuracy of just 3.9%. Even if we consider the fact that Azure doesn’t have 27 out of the 46 labels in its armor, 3.9% is a very poor accuracy. The precision and recall scores of the model, shown in table (i), further establish its weakness. Even though the precision is very high for most classes the recall is quite poor for all the classes. In a practical sense this has a major impact. For example when the model says that an image contains chicken, it is very often true, but whenever there is a chicken in an image, the model very rarely says so.
AWS Rekognition
The AWS Rekognition service is similar to Azure CV module in many aspects - it is a multi-label image classifier pre-trained on a huge dataset enabling it to identify, with decent accuracy, many number of real-world objects right out of the box without the need for any model training. In contrast to the Azure model, the rekognition model has 31 out of our 46 classes in its list of objects, which apparently enabled it to perform much better own our dataset and returned an accuracy of 23%, which is still quiet low for practical applications.
Table (ii) shows the precision and recall scores of the model for the 31 classes it could identify. Similar to the Azure model, precision is quite high for most of the classes, but unlike Azure, rekognition has high recall for many of the classes, further establishing itself as a better model for this specific use case.
Conclusion
Popular deep learning solutions like Microsoft Azure Computer Vision module and AWS Rekognition service can be a good solution for rapid deployment of many real-world computer vision applications. But certain specialized use-cases like identifying raw-food items placed in an oven can make them sweat. In our experiments Azure CV module and AWS Rekognition service scored, respectively, 3.9% and 23% accuracy on our raw-food dataset. On the other hand our own training of CNN models using food images collected from the internet yielded up to 65% accuracy on the same dataset we used to test AzureCV and AWS Rekognition. Going for a custom solution provides the opportunity to better curate the dataset, apply task specific data augmentations and control other aspects of model training which could substantially improve performance.