garote | Stable Diffusion drifts into focus

If you haven't heard of it already, here's an extremely brief non-technical explanation of how Stable Diffusion and other so-called "AI" image generation tools work:

You get a huge pile of existing art.
You label each piece of art in various ways, like "trees; grass; man running; sunny day; Tom Baker; person in foreground; comfy scarf".
The computer takes in the art, and organizes it by the labels.
Then the computer compares all the art pieces with each other, looking at them through a series of lenses that get progressively more and more blurry. With the blurriest lens, the pieces are almost identical: Big fuzzy blobs.
The computer remembers all these comparisons by compressing them in a very clever way.

Now, here's what you do with that:

You give the computer some labels it's familiar with, like "trees; Tom Baker".
The computer then makes a canvas the same size as all the art pieces it's looked at before, and plops a random fuzzy blob onto it.
Then, a little bit at a time, it tries to "re-focus" the fuzzy blob into an image by adding random bits of contrast. Each time, it asks itself, "Does this look more like a photo with trees or Tom Baker in it? Or less?" If the answer is more, it keeps that change. If less, it rejects that change and tries another.
And so on, for as long as you're willing to let it fiddle with the image.

Of course, the nature and quality of the results you get out depends heavily on what you've fed into the machine.

More specifically, it also depends on how much consensus there is between the people who made the labels. Here's an example:

Of the zillions of labels fed into Stable Diffusion (or at least the version that generated what you see above), there's an obvious trend in the ones that people labeled with both "afternoon" and/or "golden" and/or "golden afternoon". I never told the program to make images of the outdoors, or of trees, or grass. That came along for the ride because of how the humans labeled the source art, which probably contained lots and lots of photographs taken by people standing around in parks at sunset.

It's important to keep in mind that the four images above do not depict any place in particular. They are not just pre-existing things that were picked out of the source art based on the keywords, they are constructed images, and the places they appear to show do not really exist. Even partially. The system is not borrowing grass from one image and trees from another and stapling them together. It's making new images that resemble the ones that the humans associated with "golden afternoon", including details in those images that may have not been the primary reason the humans labeled them that way.

For example, the image on the lower right appears to have a lake in it. We didn't tell the computer we wanted a lake, but the images we fed into it labeled "golden afternoon" sometimes had lakes in them. So, at some point in the de-blurring process, the computer decided that the image looked more "golden-afternoon-y" if that blob resolved itself into a lake.

That seems sensible. But here's the more interesting bit: This associative power doesn't just apply for things that we've labeled in the images. It also applies for things we didn't label. Even if the computer was never told about lakes at all, it might still put one in the generated image, just because there was sometimes one in the source images.

And even more interesting: This also applies for things that we humans do not even recognize as objects in the images ... and things we may not even have the vocabulary to describe. For example, the computer was never told how to apply the back-lit effect of the setting sun on the leaves of trees. It may not even know what leaves are. It doesn't even have a concept of three-dimensional space, let alone how light moves through it. All it knows is how to ask the question, "Does this little change make it more, or less, overall, like an image with the description I've been given?" And that's it. But that simple question can go a long way.

For example, if you feed a million images into the computer, and a thousand of them are labeled "scary", the computer will get more-or-less trained to tell the difference between images that are scary and ones that aren't. Especially if you've labeled the incoming images with a ranking, from "kind of scary" all the way up to "extremely scary".

It can also learn the extremes automatically, through negative comparison. For example, if you feed it a million images of people, and a thousand of them are of Tom Baker and are labeled as such, the computer will be processing a whole lot of images of people that might look a little like Tom Baker, or even a lot like Tom Baker but will not actually be Tom Baker. And because of that skew in the data, when you ask the computer to draw a picture with Tom Baker in it, the computer will use its training to draw a person that looks EXTREMELY TOM BAKER. It will know - without consciously knowing - all the nuances that set Tom Baker's face (and shape and clothing and pose) apart from everyone else's, and it will go for them.

Same for Mr. Bean, assuming images of him were also included.

And that's why, if you tell the computer to draw "Tom Baker as Mr. Bean," you end up with this MONSTROSITY:

This is essentially a drawing that the computer has constructed by iterating on a random blob until it looks more and more and more like it's got Tom Baker or Mr. Bean in it, and unlike a human artist with a sense of proportion, it doesn't know when it's done.

This total lack of awareness becomes painfully clear when you ask it to render things that contain text. For example, if you hand it the prompt "Dungeon Master", you get stuff that looks like this:

Fun Fact: Most of these gibberish titles are actually the names of streets in Denmark!*

(* This Fun Fact has not been peer-reviewed.)

What's happening here is, the computer doesn't know what is and isn't text in the source images, let alone how to read it. Some of the source images labeled "dungeon master" may actually contain that phrase printed in them, some might not, and some will have other words as well, but the whole point of what the computer is doing is to construct new images that are a synthesis, never an exact copy. And so, a result with the bold title "DUGNGON MASNSEN" might be easily explained as the visual combination of "DUNGEON MASTER" with the single word "DUNGEON" and the single word "MASTER", all trying to occupy the same space, to resemble the most images at once.

It is indeed similar to what we expect, but we see it as a failure because written words are an all-or-nothing proposition: Either it's correctly spelled using properly shaped letters, or it's not the word.

Trees, grass, buildings, faces, and almost all other things we would recognize in an image are less complicated - and less narrow in their correctness - than a written word. And, words are even harder for the computer because the images in the source set labeled "dungeon master" also very likely contain other words, and it has no opinion whatsoever on which part of the image says "dungeon" versus which part says "master" or "adventure" or "magic", et cetera.

There is sometimes an enormous gap between an image that resembles something and an image that actually is something. One of my favorite demonstrations of this is asking the computer to give you a picture of a maze. It will absolutely look like one from a distance, but it will also absolutely not be a maze you can solve.

I had a lot of fun throwing in supplementary keywords here, because I just like the visual style of a classic black-and-white maze combined with other things. My favorite was to blend them with variations on "stained glass window" because the idea of finding a maze built into one seems really cool to me. The system can make a pretty convincing stained glass window, so the source art must have had some good examples.

The source artwork must also contain a lot of stuff on the fringes of pop culture, for example if you add in "skeksis" you get images that incorporate those ugly bird-like antagonists from The Dark Crystal:

Having these pop out of the machine in seemingly endless variation with no effort on my part was inspiring. I immediately wanted to find someone I knew who wasn't already aware of Stable Diffusion and show them an image, so they might be fooled into thinking it was from a real church somewhere. Then I could spin up some ludicrous tale about bird-worshipping Pagans on some tiny coastal island before confessing and explaining that the image was fake.

The lesson from that - aside from the obvious one about how easily this stuff lends itself to forgery - is how well this kind of generated art lends itself to brainstorming, for making "concept sketches" or fooling around with ideas, or deciding how to compose a drawing. For example:

Any one of these could have been cover art for a techno CD back in the 1980's. And of course, that's thanks to all of the hard work of the artists who made the art that was fed into the machine, as well as all the work people put in assigning labels to that art, including labels that drifted into the subjective.

Which leads to an important, and difficult question. What do you do if you're Bernie Wrightson, and people can stick a prompt like this into Stable Diffusion?