I’m sure you’ve seen the Ghibli-style images generated with ChatGPT all over the internet. They are neat, but there are so many other incredible capabilities unlocked by this release. Transformer models understand images in a new way that avoids many of the traps of diffusion models In this article I compare with the previous state-of-the-art, investigate how this new model works, and show some new types of things we can do now.
I have been playing with diffusion models — the model architecture that has been behind all the generative art (before last week) since Dall-e was released in early 2021. I’ve never used any images commercially or used them instead of paying a human artist. My interest is more in the technology of diffusion. It’s a fun hobby!
What I find most interesting is how diffusers are trained. Start from a labeled image, then add random color pixels to it until the image is unrecognizable. Once you’ve done that a billion times, the model has learned how to most efficiently remove the information in an image by covering it with “noise.” Because that’s represented in math, now you can do the math backwards. Start from random pixels, and remove the noise by correcting each pixels until the image matches what that image’s label (prompt) would have been if it existed in the training data. Wild!
I spent a couple hours using previous diffusion tools to Ghiblify this image of my wife and I.
We’ve had remarkable progress since the first release of Dall-e! The models have gotten much better, and tools have too.
And now even Flux seems terrible in comparison to GPT-4o’s native image generation. It is not a diffuser; let’s explore how it works.
GPT-4o is a multi-modal transformer model. It is the default text model behind ChatGPT, and it’s also what powers Advanced Voice Mode. It could already understand images, and last week OpenAI opened up the capability to generate images. Like diffusers, generating an image is just a matter of understanding an image in reverse. But no one had been successful with a practical implementation.
Researchers at Meta developed the Cham3leon transformer with image capabilities last May. I used a fine-tune of it called Lumina-mGPT to create this monstrosity, and it took hours:
Diffusers remove noise from all parts of the image simultaneously. GPT-4o instead produces one 8×8 block of pixels at a time: left-to-right, then top-to-bottom. You can think of it like it has turned the image into a page of text. Each bit of the image depends on the bits of the image already produced, just how LLMs produce words that depend on the previous words in the conversation. Both bits of images and chunks of words are tokens to the model, and the same transformer works across them both (as well as audio).
Occam’s Razor suggests that we should believe the text on the whiteboard in OpenAI’s first promotional image that describes the process. The transformer produces a low-resolution version of the image. Then a diffusion model takes over to improve the quality and scale it up. We don’t know yet if the diffuser applies on each block of pixels or only the entire image. Some folks have inspected the images sent by ChatGPT to the local browser and found that the pixels at the top of the image change a bit as the image completes. One last thing we’ve discovered is that it handles transparency by generating an image of the traditional black & white checkerboard. A background-removal tool is run afterwards.
This was my first try with GPT-4o. Instead of hours of work and dozens of failures, I got nearly perfect results right away. It’s got the correct accessories on each of us, managed to get my beard approximately correct, and even got the pattern of my shirt! You can tell it “understands” what the image is in a way that is deeper than what the pixels are. Maybe someone more skilled with diffusion tools could match this, but it would take them all day.
This is a watershed moment for generative AI, and not only because image generation is another step-function better.
Treating chunks of images like chunks of text marks a significant breakthrough. So much more is possible than the text-to-image and image-to-image that we were using before. Just as OpenAI found emergent behavior when they scaled up GPT to 3.5, we are seeing the same with GPT-4o image generation. The model behaves as though it understands the meaning of the things in the image and their relationships. In the picture of my wife and I, you can see it is thematically correct and gets all the details. However, it has moved things around a little bit; it is not matching pixels!
Diffusers are stuck relating pixels to other pixels. In the training data, a woman and a mirror always does the same reflection. A diffuser could never do anything else. GPT-4o moves out of this trap to create entirely novel scenes like this one in the header image.
I don’t want to over-hype, so I’ll stick to publicly demonstrated scenarios that were fully impossible before:
Be warned, deepfakes are now also trivial. Any person can be put into any existing image or new image. Or you could send your boss a picture of a flat tire on your car, with a receipt of the repair bill. The unscrupulous will remove watermarks from stock images. These tasks were possible before, but now they are trivial. Don’t trust any picture.
It feels like anything that you want to do visually is now possible. Is that true?
There are some things that GPT-4o cannot do yet. Images with a lot of text will have artifacts. Resolution is limited. There are still places where it cannot go against its training data; I saw someone generate an upside-down poster but some of letters didn’t flip and the words were reversed instead. Perhaps we need a Godzilla-sized model?
You can see in these screenshots that there is a selection tool to tell the model what to focus on, although I’m not sure it works yet. I expect GPT-4o is going to get more tools like the background removal that it has now. It could get SVG generation as well as basic editing tools like flip and reverse. Many more tools from open source would be valuable, like inpaint sketching (drawing a simple version of what you want). A more controllable way to do outpainting would be very welcome. And finally there’s latency; when images take only a second or two to generate, whole other use cases open up.
Image generation has never been easier to get incredible results. With some trial and error, you can create or modify anything you can imagine. There’s never been a better time to get into the hobby. What will you create?
See how testing your agents with LLM-judged questions (evaluation) will improve their quality, prevent regressions,…
Tired of boring answer? See how to create AI agents with your own tastes and…
Agent instructions are critical to get right. Learn why to write in Markdown, go right…
Effective AI agents handle complex or menial workflows with instructions, knowledge, and skills. Learn to…
Unlock greater efficiency with AI agents that collaborate through conversation, refining and enhancing tasks.
How can we design incredible AI agents that solve real problems for users? Identify the…