Google’s New AI_ OpenAI’s DALL-E 2, But 10X Faster!
Today we are going to see progress in text to image research that is so incredible it hardly seems believable. So, what is text to image? Simple – these are AI-based techniques where a text prompt from us goes in, and a beautiful image comes out.
There are already a large set of techniques that can perform text to image really well, for instance, OpenAI’s DALL-E 2 can do it, where we wait approximately 10 seconds for each image. Or, we can even run it on our own hardware with the free and open source Stable Diffusion.
So, are we done? What else is there to invent here? Why write more papers? Well, by the time you finish this video, I hope you will agree with me that the only possible answer is: my goodness, there is so much to be done and so much that has just been improved.
For instance, this new technique from Google that they call Muse can perform mask-free editing too. What is that? To be able to appreciate what that is, let’s look at what mask-based editing looks like. Imagine that we have a wonderful photo here, but we would like to change the location of the background. No matter, let’s just highlight this region, which we will call a mask, and ask for a different background. And, yes, it can repaint the image as if it were in New York, Paris, or San Francisco. Wonderful. This is image inpainting by using a mask and a text prompt.
And now, let’s do this, but without a mask! What? How is this even possible? Well, we can tell the AI, come on, you are smart enough to know where these objects are and what they are, so, you do the masking yourself. Automatically. So, can it? Let’s see.
There is a cake in this image, and a coffee latte. But, psst! Don’t tell the AI where it is. Confidential information. Let it find out itself. Now, little AI, please change the cake to a croissant, if it even exists, who knows, the latte art should form a flower. And… wow. Look at that. Who could say that this was not the original image? It has done absolutely amazing. And, I am a light transport researcher by trade, so I cannot resist mentioning that the specular highlights on the new plate are also excellent. Good job, little AI!
But its mask free editing capabilities can do even better. Look. We can change our clothes super easily, even with text, remember, synthesizing text properly was quite difficult for previous techniques, and this just does it easily. So good!
Now, get this, because we can also do this with drawings, and if we do, something amazing happens. Look. We can start out from a crude drawing of a cat, and ask it to morph into different other animals. And through this, it can become a dog, or a pig, a raccoon, or even other animals.
This is possible because this new technique is not your usual diffusion-based process like many previous image generators. What does that mean? It means that it does not start out from noise and does not reorganize this noise to get a coherent image. It does not think in terms of noise at all.
But, we are not done yet, not even close! It can also perform image outpainting. That is, taking this part of the image and replacing the entirety of the image around it using a text prompt. Travel around the world with just one text prompt. So cool!
Now, have a look at these images and their prompts. Great works, right? But, there is something that ties them together. Do you know what? Well, hold on to your papers, because all of these images took just approximately one second to generate. That’s right, this is up to 10 times faster than previous techniques! And all this less than a year after DALL-E 2 has been published. That is insanity.
And it can also perform things that other previous techniques had a great deal of trouble with. What are those? Well, two examples, cardinality and composition. Cardinality means that if we ask for three elephants standing on top of each other, we really get three. If we ask for four bottles of wine, we get four. And if we ask for 10 bottles of wine, wait a minute…yes, apparently not even this technique is perfect.
It also does well when it comes to composition. If we ask for the two baseballs to be to the left of the tennis balls, the AI understands that and thus, they will likely end up being there.
However, I also love how it combines all of these these concepts together. For instance, here, we can do mask-free editing while keeping the composition of the original image the same. This way, we can transform our cat into a dog, change a small basketball into an American football, or make our cat yawn, or even change these flowers. And note that we did not need a mask for this, no highlighting regions where the cats and roses are, we just write what we want and the AI does it! Also note once again that the composition of the original image remains intact. I absolutely love this. Such an amazing tool, and now, a really fast one too. One second for each of these? Sign me up right now!
And just imagine that two more papers down the line perhaps all this will be possible to do in real time. We might be able to create little virtual worlds with the speed of thought. How cool is that! What a time to be alive!
So, what do you think? What would you use this for? Let me know in the comments below!
Thanks for watching and for your generous support, and I’ll see you next time!