Give free rein to creativity using AI to create images from text
By Matteo Pilotto, Matt K, Anush P, Parth R, Jash R, and Mahzad K
Introduction
On January 5 our life was forever changed. We witnessed the proverbial avocado armchair.
What is so spectacular about that image, other than the fact that it’s an avocado-chair hybrid, is that it was not made by man, but by machine. That machine’s name is DALL-E. Type in anything that you wish to see and DALL-E will kindly generate dozens of variants of whatever your wildest imagination can dream up.
DALL-E was created by OpenAI. Despite its name, there is little about this model that is “open”. OpenAI, as of today, has kept this magical machine all to themselves. That’s why today there is a small but active regiment of open-source warriors who are trying to create their own. This is the story of how our small troop of junior AI enthusiasts adopted this cause as our own, our trials and tribulations, and ultimately how we created our very own DALL-E clone.
Finally, before starting, we would like to give a huge shout-out to all the ML community because most of our work is built on top of the efforts of men and women way more talented than us who decided to share their creations on the internet and give the possibilities to newbies like us to move the first steps in this crazy filed that advances at the speed of light.
Among these active warriors we would like to directly thank:
- lucidrains, janEbert, afiaka87, robvanvolt, rom1504 and all the other contributors for creating DALLE-pytorch, an open-source implementation of DALL-E
- the research group at Tsinghua University for sharing CogView
- tgisaturday for the work done with dalle-lightning, refactoring of DALLE-pytorch for TPU VM
- Ryan Murdock for conceiving DeepDaze, BigSleep and finally LatentVision
Without their contributions and of many others we would probably have not achieved anything.
Overview of DALL-E
DALL-E is a 12B parameter decoder-only transformer capable of generating original images based on any input text without additional training (zero-shot learning). The model was trained on a massive dataset of 250 million text-image pairs with the objective of autoregressively modelling text and image tokens as a single stream of data.
However, it would be too simplistic not to mention two important additional ingredients that makes this model work so well: the dVAE and CLIP.
In the context of DALL-E, the dVAE provides the image tokens, condensed representations of the original images the dVAE was trained on. These tokens are not only less burdensome to process, but they also allow the transformer to learn how to draw entire patches of images at once rather than generating the illustrations pixel by pixel.
CLIP is a pretrained contrastive model and was announced simultaneously with DALL-E, but unlike the latter, it was open-sourced. In simple terms, this model measures how well an image and a text match together. In the DALL-E framework, it finds the most representative images of the input text among the generations.
DALL-E seems to be extremely “intelligent” and excels in many aspects. Among its numerous capabilities, the one we found most interesting (and comical) is its ability to create new images by mixing up elements of unrelated concepts. For example, in the OpenAI blog post we can admire some cool images of a "snail made of harp" and "an armchair in the shape of an avocado" which undoubtedly became the two of the most iconic images generated by the model and fostered many ML practitioners around the world, including us, the burning desire to realize an open-source version of this model.
We would also be curious to test its talent in creating simple illustrations of animals, objects, and emojis that in many cases could be easily mistaken for images drawn by an imaginative child or even an artist.
We dream of a world where a pretrained DALL-E-like model is easily accessible from smartphones where we can see people of all backgrounds expressing themselves and their ideas using unique illustrations generated by the model.
During these four months DALL-E has been our north star, but we didn’t disregard alternative solutions. In fact, we started our journey with BigSleep.
First experiments with BigSleep and Geneva
When we first started this adventure, we thought the most natural thing to do was to check which solutions were already available out there. One of the first architectures to catch our attention was BigSleep, a model created by Ryan Murdock by combining BigGAN and CLIP. Even though the quality of the generated images were miles away from OpenAI's DALL-E, the results were already quite impressive for a first try. The biggest problem (other than quality) with these methods was that it took hours to get to a final result. What we want is an implementation capable of generating high quality images in under a minute and with BigSleep this just wasn’t the case.
Our next stop was GeNeVA. More like a task of its own rather than a model. With Geneva the authors went beyond single-step generation, creating a model capable of generating images iteratively based on a stream of text inputs. This approach seemed highly promising because it involved ongoing interactions between users and models, a solution that we thought could make the long inference time more enjoyable. Probably due to the fact that most of us were just getting started with coding, getting the model up and running on a colab notebook was quite a challenge.
After spending almost two weeks dealing with the most diversified carousel of error messages to get the model up and running, our newfound enthusiasm quickly melted away when we saw the awful images generated after training the model for more than 200 epochs. As that alone wasn't already bad enough, we also started realizing how difficult it would have been to come up with a custom dataset that satisfies the iterative structure of text and images required by the model.
First success: Image scraper with style-transfer
For our baseline we were looking for something surprisingly simple that could be implemented and tested within a week. After brainstorming various ideas we thought "if we want an avocado armchair why don't we just google it?". This is how the “scraper style-transfer” idea came to be. Essentially, our architecture is made of three simple elements: an image scraper, CLIP and a pretrained style-transfer model.
We took a ready-to-use image scraper from fastai to get a collection of images from DuckDuckGo based on any input prompt. Then we pass the resulting images through CLIP in order to find the images that best matched the search query. Finally, a style-transfer algorithm is applied with the desired style and presents the stylized image to the user. Despite its simplicity, it works surprisingly well! Check it out for yourself (notebook).
After more than a month of hard work, we finally had a working solution that in its own way got the job done. It generated cool images to share with the people we care about. Clearly, we were still miles away from the unique "wow factor" that only a model like DALL-E was capable to deliver, but we felt we were on top of the world 🏔️ because that was our first success.
An unexpected ally: CogView
It was the end of June and we were just starting brainstorming ideas on how we could turn our simple idea into an actual web app, when out of the blue a team of researchers from Tsinghua University released CogView, a smaller but open-source version of DALL-E trained on a large corpus of Chinese text-image pairs.
At that time, we had essentially put aside the idea of working on a "real" text-to-image model. We didn’t have access to the infrastructure necessary to train a model with more than 100m parameters let alone 12B, and additionally had no experience or expertise training such a large model. Now with Cogview we had a pretrained model and that meant we were back on track to get our avocado armchairs.
This model had everything we could have hoped for, except for two small details: it required texts in Chinese and some of the generated images contained annoying watermarks. To make the model get English text we added a simple English-to-Chinese translator. Undoubtedly, in the process some linguistic nuances were lost, but this simple solution worked well and so we decided to keep it.
For the watermarks, we first tried various denoising models to remove the irritating text without completely losing the images, but none of the solutions we tested worked well. So we tried to train a simple binary watermark classifier to filter out images with watermarks. Using fastai, we fine-tuned a resnet50 on a very small dataset (less than 500 instances) of images generated by the model itself and... BOOM! 💥 In less than 10 line of code we had a classifier capable of properly classifying 9 out of 10 images. Again, despite its simplicity, a naïve approach solved our problem.
If we arouse your curiosity and imagination, you can play with the demo we prepared for you (notebook)
Cogview isn't enough: train our own model
Despite its few drawbacks, Cogview was more than what we could have asked for, but it wasn't enough anymore. When we started this journey we barely knew how to clone a repo from github and everything seemed an insurmountable obstacle, but every challenge we faced contributed to increase the confidence in our capabilities. Obviously, in this short period of time we didn't turn into ML experts, but we were a far better version of ourselves than three months ago.
With this boost in confidence and the great results achieved by other programmers, we decided it was time to train our own model. Our original plan included refactoring Cogview code to support TPUs with the idea to take advantage of TPU Research Cloud (TRC) program offered by Google. After two weeks in the process without much progress we realized we had bitten off more than we could chew. Providentially, a few days later we found a TPU implementation for DALL-E. Once again the contribution of the community kept us moving forward.
A call to arms!... of avocado chairs & other cool resources
We have finally reached the present time and these days we are scraping text-image pairs to build our custom dataset. Once completed, we planned to dedicate a month or so to train our model using the TRC program. So stay tuned for the eventual release of our open sourced (english) proverbial avocado armchair machine.
If you reach this point, we hope you enjoyed reading this, but more importantly we hope we have sparked in you the desire to join the battle for living in a world of unlimited fruit furniture marvels.
Throughout this journey we came across peculiar ideas that we didn’t manage to explore in depth and add substantial contributions to them. Nevertheless, we encourage you to try them out as well because they are really cool:
- StyleGAN-NADA by Rinon Gal et al. blends CLIP and StyleGAN2 together to stylize images in an incredible way. We created a notebook with two additional pretrained models (metfaces and cifar10).
- DALL-E mini by borisdayma et al. is one of the most successful replica of DALL-E so far
- CLIP Guided Diffusion by Katherine Crowson combines OpenAI’s diffusion models and CLIP to generate artistic illustrations. If you need some inspiration check out her twitter page.
- Latent Vision by Ryan Murdock is another excellent architecture to generate picturesque images this time leveraging CLIP and VQGAN.
We will leave you with a superb quote that we feel it summarizes exceptionally well these four months:
“Those who do not want to imitate anything, produce nothing.”
― Salvador Dalí