Testing the OpenAI CLIP Model for Food Type Recognition with Custom Data
Abstract:
In an effort to better understand recent advancements in computer vision, we use custom datasets to investigate three machine learning model architectures to compare their ability to learn and identify complex visual concepts. Specifically, using custom datasets of different food types, we test the performance of two models (zero-shot and linear probe) that leverage a neural network developed by OpenAI—using the Contrastive Language–Image Pre-training (CLIP) method against a ResNet50 model. We find that the CLIP linear probe model delivers the most accurate results in each instance, by a notable margin, while the performance hierarchy of CLIP zero-shot and ResNet50 depend on the nature of the dataset and tuning.
By Manav G; Emmy K; Tyler H. L; Chris M; Pankaj S; Aditi S
Motivation:
Recent innovations in computer vision use novel techniques that suggest greater robustness in image recognition, and show promise for increased utility in a variety of applications. In particular, the OpenAI Contrastive Language–Image Pre-training (CLIP) method uses pre-labeled data from the internet, such as Instagram photos accompanied by a descriptive phrase (for example, “this is a photo of my cat”) to build a model with a zero-shot capability. That is, the ability for a model to classify images without training on a labeled dataset (thus, ‘zero-shot’). Impressive and resourceful though it may be, these models have thus far been tested largely on well-curated publicly available datasets. The question we investigate here is the following: Does this new method produce models with the characteristics to perform well for reliable usage in bespoke real-world applications?
Overview:
In January 2021, OpenAI released a neural network called CLIP (Contrastive Language–Image Pre-training) that learns visual concepts from natural language supervision with a ‘zero-shot’ capability. In the following three months, a team of Machine Learning Research Fellows conducted a variety of computer vision experiments using CLIP to understand its ability to recognize food types under various implementations (zero-shot and linear probe), and compared the results to those delivered by a ResNet50 model.
The research team curated several custom datasets for the research, in addition to using two large publicly available datasets.¹ The team found that CLIP zero-shot performed as predicted by OpenAI (in the paper that introduced the network, Learning Transferable Visual Models From Natural Language Supervision, henceforth, ‘the CLIP paper’, and its website companion: https://openai.com/blog/clip/), while CLIP linear probe delivered consistently better results. The performance of CLIP zero-shot was approximately equivalent to a trained ResNet50 model, though able to achieve this without training.
The CLIP Model
Using 400 million publicly available text-image pairs found on the internet, the CLIP network (henceforth, ‘model’) pre-trains an image encoder and a text encoder to make a zero-shot prediction to match images with text (labels) in a dataset. The model computes the feature embedding of an image and the feature embedding of a set of possible texts using their respective encoders. The cosine similarity of these embeddings is then calculated, scaled by a temperature parameter Ꞇ, and normalized into a probability distribution via a softmax. This prediction layer is a multinomial logistic regression classifier with L2-normalized inputs, L2-normalized weights, no bias, and temperature scaling.
Models Included for Analysis
CLIP Zero-Shot: CLIP ViT-B/32 using cosine similarity.² The model is implemented with basic prompt engineering; for example, when running the analysis, this model would use the prompt ‘this is a photo of spaghetti’ for food categorized as ‘spaghetti’.
CLIP Linear Probe: CLIP ViT-B/32 with logistic regression performed on CLIP encoded image features.
ResNet50: This research uses various implementations of ResNet50. The first, which we refer to as the ‘baseline’ ResNet50, is a ResNet50 model with standard ImageNet weights (all layers before model head are frozen), with a single-layer logistic regression output layer as the model head. We also implement a fine-tuned³ ResNet50, with a single-layer logistic regression output layer as the model head.
Datasets
The research team created six custom datasets for the research (and also ran initial analysis using Food101 & iFood). Each custom dataset focused on particular types of food or possessed unique characteristics to help determine the usefulness of CLIP zero-shot for image classification. The basic characteristics of these datasets are included in the following table:
The Custom A dataset consists of images generated to mirror the characteristics of a real-world industry use-case; it serves as a proxy to test model performance for potential commercial application(s). For each class, the research team conducted a manual Google image search, identified and selected appropriate images, saved the URL for the ‘related images’ tab (of the Google results page) of the chosen images, and then used a custom image scraper to save the ‘related images’ for each URL.
The Custom B dataset consists of images scraped from food blogs and Instagram; the classes were determined by simple brainstorming. Two variants of the dataset were tested: curated and un-curated. The un-curated variant of this dataset included generic images; that is, the images may include people, furniture, tableware, and other objects, in addition to the targeted food type. The curated version of this dataset consists of images that do not contain extraneous objects. The dataset was created using largely the same technique as the Custom A dataset, though using the Bing image scraper rather than a custom scraper.
A similar methodology was employed to construct custom datasets ‘C’, ‘D’ and ‘E’. The largest of the custom datasets (Custom C) was created using TasteAtlas.com, a food website that lists the most popular foods in each country. The researchers chose the 100 most popular foods for 16 countries + Africa as the basis for the dataset.⁴ After identifying the most popular foods, the research team scraped approximately 120 images for each food type. The Custom C dataset holds 24,357 images (after pruning) from 257 classes, and makes a special effort to include foods from regions outside of Europe and North America.
The Custom D dataset consists of images of 14 vegetable types classified according to the distinct manner in which they are cut (for example, diced peppers or sliced tomatoes). The Custom E dataset is a reconfiguration of the Custom D dataset; it re-categorizes each image strictly according to the type or shape of cut (for example, dices, slices, cubes, etc).
The Custom F dataset consists of images of food types subjectively determined (by research fellows) to be difficult to identify; its purpose is to gauge the classification performance of the CLIP models versus non-expert human judgment in cases a human would find difficult (thus, ‘Exotic Foods’). The food types were identified using the most popular foods in various countries on Tasteatlas.com. These images were scraped using the Google image scraper.
Other than the un-curated version of the Custom B dataset, the images for each dataset were manually inspected and cleaned to ensure consistency with the respective purpose / use-case. The team reviewed the images manually, and discarded images not suited for the purpose. That is, if the image was not a clear depiction of the target food type, or the image included other food types, words, letters, people, utensils, and so forth, it was discarded.
The researchers also removed special characters from labels, and merged similar classes. For instance, many countries have a version of roast chicken, roast pork, bean stew, and other foods (such as different varieties of quiche or pizza) that look nearly identical. The same is true for mooncakes among countries in Asia, where the same food goes by several different names. These cases were identified and merged into a single class.
Experiments And Results:
As OpenAI noted, task-agnostic zero-shot classifiers for computer vision have not received significant attention thus far. The research performed here provides an opportunity to better understand the CLIP zero-shot model and its performance characteristics. This section explains experiments conducted to do so.
The research team examined the analytic performance of the CLIP zero-shot and linear probe models for each dataset described in the preceding section. Further, to situate understanding of the results within the analytic landscape, the researchers also compared the results of CLIP zero-shot and linear probe to a performance baseline, using various implementations to fit a classifier on the features of a ResNet50 model.
Top-1 Accuracy Analysis:
The following table shows the analytic results for each of the datasets.
As the table of results (above) indicates, the CLIP linear probe model outperforms both CLIP zero-shot and the two ResNet50 models on each dataset for Top-1 accuracy. Further, the baseline ResNet50 model (with a single layer head) performs better than CLIP zero-shot on three (3) datasets (3 of 6 instances, custom datasets C, D, and E). Custom A, B, and F are the datasets that yield a result where CLIP zero-shot performs better than the baseline ResNet50 model. The fine-tuned ResNet50 (with a single layer head) performs better than CLIP zero-shot on 4 datasets (4 of 6 instances, custom datasets C, D, E, and F). CLIP zero-shot performed better than the two ResNet50 models on only two (2) instances, when analyzing custom datasets A and B.
In the CLIP paper, the OpenAI authors describe two limitations applicable to this analytic circumstance that explain potential reasons for the outcome that the two ResNet50 models seem to perform better than CLIP zero-shot. Firstly, CLIP zero-shot struggles compared to task specific models on very fine-grained classification, such as telling the difference between car models, variants of aircraft, or flower species. Secondly, CLIP zero-shot generalizes poorly to images not covered in its pre-training datasets, and the OpenAI researchers note that “for novel tasks which are unlikely to be included in CLIP’s pre-training dataset, such as classifying the distance to the nearest car in a photo, CLIP’s performance can be near random.”
To distinguish food types is potentially a more challenging analytic problem than differentiating between various models of cars, aircraft, and types of flowers; as noted previously, many food types appear quite similar; differences between similar types are often subtle. Even with a well-curated dataset (such as the Custom C dataset used in this analysis), where the most frequently confused food types were merged into a single category, it can be difficult for a non-expert human to distinguish among the remaining classes.
It is also possible that some food types in the datasets were not included in CLIP’s pre-training dataset. However, we are unable to confirm this potential explanation (for this project, categorizing food types), as we are not privy to CLIP’s pre-training dataset. For that reason, we are unable to determine whether the food images in these custom datasets are truly out-of-distribution for CLIP.⁵ If true, however, that may explain why, despite that the Custom C dataset has more classes and fewer images per class than large public datasets (thereby posing a greater challenge for a model that requires training), the ResNet50 models performed better than CLIP zero-shot.
Separately, one unexpected result was the relatively poor performance of CLIP zero-shot on the Custom E dataset, which asked each model to classify images of vegetables according to the distinct manner in which they were cut (slices, dices, etc). Whereas the accuracy of the fine-tuned ResNet50 model improved by nearly 6% when moving from the Custom D to Custom E datasets, and the baseline ResNet50 model improved 1.5%, the classification accuracy for CLIP zero-shot dropped approximately 27%.
The poor performance of CLIP zero-shot on the Custom E dataset is particularly unexpected because the analysis requires merely the recognition of shapes, albeit from differing vegetable types, whereas the Custom D dataset required the model to classify images according to both type of vegetable and type of cut (for instance, sliced carrots). For CLIP zero-shot to struggle, as evidenced by the worsening accuracy when moving from the Custom D dataset to the Custom E dataset, is surprising because the classification task appears to require a less sophisticated distinction.
The improvement in performance of both ResNet50 models (when moving from the Custom D dataset to the Custom E dataset) suggests that to classify the ‘cut’ (shape) of a vegetable (regardless of type) is an easier task than to classify the type of vegetable combined with type of cut. Moreover, the fact that the accuracy of CLIP linear probe decreased by only 1% when moving from the Custom D dataset to the Custom E datasets suggests that CLIP model encoding (and features) contain sufficient information to distinguish among the categories when performing logistic regression.
It is not clear why CLIP zero-shot struggled more with Custom D dataset than Custom E. From the perspective of machine learning, however, recognizing the cuts (shapes) in this instance may require a higher level of abstraction because the shapes are represented by differing vegetable types (hence, colors and textures).
Top-2 Accuracy Analysis
The following table shows the analytic results for each of the datasets.
Just as in the case for Top-1 Accuracy, the linear probe model outperforms both CLIP zero-shot and both of the ResNet50 models on each dataset for Top-2 accuracy. The results also indicate, as we saw previously, that the performance hierarchy of CLIP zero-shot and the ResNet50 models is equally divided.
The baseline ResNet50 model (with a single layer head) performs better than CLIP zero-shot on three (3) datasets (3 of 6 instances), and for the same datasets (C, D, E), while the fine-tuned ResNet50 model performed better than CLIP zero-shot on Custom C, D, E, and F. Custom datasets A and B are the datasets that yield a result where CLIP zero-shot performs better than the ResNet50 models. The fine-tuned ResNet50 performed better than the baseline ResNet50, on each instance.
This instance employs the same approach as for the Top-1 analysis; in the first instance, we used a baseline ResNet50 model with a single-layer head, freezing all layers before the model head rather than fine-tuning the ImageNet weights. In the second instance, we fine-tuned the ImageNet weights on a ResNet50 model, and used a single-layer head for classification. This provided insight into the change in performance that results from fine-tuning.
As expected, the accuracy percentages for Top-2 analysis are higher than for Top-1, and the variance between models is seemingly smaller. One exception remains the performance of CLIP zero-shot on the Custom E dataset (56.15%), which is approximately 35% below the next lowest scoring model (91.45%), delivered by the baseline ResNet50. It is not apparent why the accuracy score of each of the other models tested either improved or held steady, when moving from the Custom D datasets to the Custom E dataset, while the performance of CLIP zero-shot worsened by nearly 30%.
To reiterate, the Custom D and Custom E datasets are the same dataset, except with different labels; for instance, whereas ‘sliced carrots’ would be an individual category in the Custom D dataset, those images are re-labeled as simply ‘slices’ in the Custom E dataset, and grouped (by shape) with other types of sliced vegetables. That CLIP zero-shot would struggle with this classification—a seemingly straightforward abstraction—is puzzling, particularly when logistic regression operates on the same encoded data, and doesn’t suffer the same degradation in performance.
Accuracy Analysis with Enhanced Prompt Engineering:
The research team also created several variants of the Custom C dataset based on TasteAtlas.com. In an effort to get greater analytic accuracy, this dataset was reconfigured and pared down on several instances since it was created. That is, the research team created different labeling schemes for the food categories, to experiment whether different categorization will yield better or different results.
The refinement of this dataset included the use of ‘fine’ labels associated with a corresponding broader category. This enabled the use of enhanced prompt engineering (for CLIP zero-shot); that is, associating one type of food with a larger category to which it belongs. For example, the food type labeled ‘lasagna’ would be associated with the broader category ‘pasta’. Rather than seek to have the model classify an image representing the food type ‘lasagna’, CLIP zero-shot would instead use the prompt “this is a photo of a lasagna, a type of pasta”. The results of this analysis are included in the following table.
As the table indicates, the accuracy of the CLIP zero-shot model improves notably with the use of enhanced prompt engineering. The Top-1 accuracy improved approximately 12%, while the Top-2 accuracy improved by roughly 9%. For both Top-1 and Top-2 analysis, enhanced prompt engineering allowed the CLIP zero-shot model to perform better than the baseline ResNet50 model, but still as well as a fine-tuned ResNet50.
Concluding Remarks
The results of the experiments conducted are largely consistent with what was predicted by OpenAI. We find that the CLIP linear probe model delivers the most accurate results, while the performance hierarchy of CLIP zero-shot and ResNet50 models depend on the nature of the dataset and tuning. While CLIP linear probe delivered consistently better results, the performance of CLIP zero-shot was generally (though not always) less accurate than a fine-tuned ResNet50 model. These results suggest mixed analytic success for CLIP zero-shot with respect to image classification, although they reveal promise for linear probe.
Just as with OpenAI, we found that the performance of CLIP zero-shot is on average competitive with the simple supervised baseline ResNet50 model with a linear regression classifier as the model head. In the CLIP paper, the authors show that CLIP zero-shot outperforms a ResNet50 baseline on 16 of 27 public datasets, and that the difference in the accuracy ranged as much as 37.1%. In the analysis conducted here, the range of the differential (for accuracy) also varied widely; though the performance of CLIP zero-shot was comparable to a baseline Resnet50 model, the fine-tuned ResNet50 model performed slightly better.
Although using larger ResNet architectures would likely improve performance (of the ResNet models), as would a multi-layer perceptron as the model head, the high accuracy and robust performance of the CLIP linear probe model nonetheless suggests it has favorable characteristics to perform reliably in real-world applications. Particularly for CLIP linear probe, the analysis conducted on custom datasets in this project demonstrated its ability to learn and identify complex visual concepts. The analytic innovations in the CLIP technique could have utility in many image recognition and computer vision applications.
Applications Developed
In addition to performing this analysis, the research team also created and deployed StreamLit Apps, as well as a custom designed app, to employ the methods created and showcase the results.
Appendix A:
Classes for ‘Custom C’ [TasteAtlas] Dataset:
aji de gallina, almendrados, apple pie, arroz con leche, ayam penyet, babi panggang, bacon, bagel, baguette, bakaliaros, baked ziti, bamies, banana bread, basil pie (vasilopita), batagor, beef steak, beef wellington, beets, beggar's chicken, beijing duck (peking duck), biber dolması, bindaetteok, bird's milk cake (ptichye moloko), biscuits, bolillo, bougatsa, bow tie pasta (farfalle), boyoz, bread pudding, brioche, broccoli, brownies, bruschetta, brussel sprouts, burnt cream (crema catalana), cabrito, cachopo, calzone, canelé, cannoli, carrots, cassata, cauliflower, chashu, chateaubriand, chicken and welsh onion skewers (negima yakitori), chicken breast, chicken katsu (tori katsu), chicken nuggets, chicken parmigiana, chicken tenders, chicken wing, chinese bbq pork (char siu), chistorra, chocolate chip cookie, chocolate soufflé (soufflé au chocolat), chouquette, churros, ciabatta, ciambella, clafoutis, cochinita pibil, cod fillet, cookie dough, corn on the cob, cornbread, coulibiac, croissant, croque-monsieur, croquettas (croquetas), crostata, cupcake, dacquoise, dara thong, egg tart, eggplant parmesan, empanadas, enchiladas, ensaïmada de mallorca, escalivada, escargot, etli ekmek (turkish meat flatbread), falafel,fanouropita, fish fry, fish-shaped pastry with red beans (bungeoppang), flamiche, flores de hojaldre, focaccia, french butter cookie (sablé), french fries, fried chicken, fried duck (bebek goreng), fudge, galaktoboureko, galatopita, galette des rois, gaziantep baklavası, giouvetsi, gnocchi, gorengan, graham cracker, gratin, gratin dauphinois, greek green beans (fasolakia), greek rusk (paximadi), green beans, grissini, hawawshi, hot chicken, indonesian fried chicken (ayam goreng), japanese croquettes (korokke), japanese sponge cake (kasutera), kabak tatlısı, kale, kalitsounia, karaage, karydopita, kebab, keftedakia, kemalpaşa, key lime pie, kibbeh, king's bread (roscón de reyes), kleftiko, kolokithopita, kourabiedes, kreatopita, kue putu, kumpir, lamb chops, lamb shank, lasagna, lava cake (molten chocolate cake), little meats (carnitas), longaniza, mac and cheese, macarons, madeleines, manakish, marranitos, mashed potatoes, matbucha, meat balls, meatloaf, medovik, migas, milk tart (melktert), mille feuilles, milopita, mooncake, moroccan flatbread (khobz), mozzarella sticks, muffins, mushrooms, mussels, napolitana de chocolate, new york style cheesecake, new york-style pizza, oncom, pachamanca, pain au chocolat, pain aux raisins, panettone, paris-brest, pastiera, pastilla, pastitsio, patarashca, peanut, pecan pie, penne, peppers, pestil, pinchitos, pirog, pirozhki, pisang goreng, pita bread, pizza, pizzette, pollo asado, popover, porchetta, porterhouse steak, portokalopita, pot pie, potatoes, pozharsky cutlet,prime rib, profiterole (choux), pulled pork, quesadilla, queso fundido, quiche, quiche au fromage, quiche lorraine, rabas, ramadan bread (ramazan pidesi), rasstegai, ratatouille, ravioli, red velvet cake, rendang ayam, revani, ribeye, rice crackers (senbei), risoles, roast pork shoulder (pernil), roast squab, roasted potatoes, rocoto relleno, russian apple cake (sharlotka), salmon, sfincione, shaobing, shrimp saganaki (garides saganaki), sicilian pizza, siu yuk, smetannik, soufflé, soutzoukakia smyrneika, soy sauce chicken, split belly (karnıyarık), steak, stromboli, suckling pig (cochinillo), surf and turf, sushki, sweet potato, taiyaki, tamal, tarta de santiago, tarte flambée (french flatbread), tater tots, tempeh, teriyaki, thot man kung, tilapia fillet, tiropita, torrijas, tsoureki, tteokgalbi, tuna steak, turkey breast, vatrushka, waffle,whole chicken, whole turkey, wingko, xianbing, yakitori, yemista, yumurtalı pide (turkish flatbread), zucchini, éclair (choux), şekerpare
Classes for ‘Custom F’ [Exotic Foods] Dataset:
‘Aloco’, ‘Barbagiuan’, ‘Bigos’, ‘Borscht’, ‘Farikal’, ‘Feijoada’, ‘Goulash’, ‘Jambalaya’, ‘Kibbeh’, ‘Larb’, ‘Moussaka’, ‘Nihari’, ‘Pavlova’, ‘Pelmeni’, ‘Plov’, ‘Roesti’, ‘Sauerbraten’, ‘Stamppot’, ‘Succotash’, ‘Tagine’
Notes:
1 The two large publicly available datasets are Food101 and iFood.
2 The results included here are produced by CLIP ViT-B/32, a smaller version of the CLIP zero-shot model used by OpenAI in the paper that introduced the CLIP model.
3 ResNet50 model with fine-tuned ImageNet weights.
4 Appendix A includes a list of the classes for the Custom C dataset.
5 Moreover, time and resource limitations prevented determining whether the images in the custom datasets created for this project might also be included in datasets such as iFood, Food101, or ImageNet.