IT 06: AI & Art, Idea-Set-Match

B Some-One II, M. © (2022)

A major recent development in AI-research was automated image captioning. Machine learning algorithms could already label objects in images, and now they learned to put those labels into natural language descriptions.

We can do image to text. Why not try doing text to images?

It was a more difficult task. The researchers didn’t want to retrieve existing images the way a search-engine does. They wanted to generate novel scenes, differentiated from this world. The first image was a 32 x 32-pixel tile. This application showed the potential for what might become the possible future.

Generating a novel scene from any combination of text-input requires a different approach. Huge AI models contain every image they are fed. What this means is that we can now create images without having to execute them in paint or with a camera, a pencil, tools, or code. The input is just a simple line of text.

The craft of communicating  with these deep learning models has been dubbed “prompt engineering”. If you can find the right words, you could refine the way you talk to the machine. For an image generator to be able to respond to so many different prompts, it needs a massive, diverse training dataset. Hundreds of millions of images scraped from the internet, along with their text descriptions.

Latent Space

What do the AI-Models do with them?

You might think they go through the training data to find related images and then copy over some of those pixels, but that’s not what’s happening. The new generated image doesn’t come from the training data, it comes from the “latent space” of the deep learning model.  

If I gave you two images and told you to match them to two captions, you’d have no problem.

What images look like to a machine are just 1s and 0s, pixel values for red, green, and blue. You’d guess, and that’s what the computer does too at first. You could go through thousands of rounds of this and never figure out how to get better at it.

Which grid of 0s and 1s [Image] is related to this other grid of 0s and 1s [Caption]...?

Whereas a computer can eventually figure out a method that works- that’s what deep learning does. To understand that this arrangement of pixels is the Black Tree, and this arrangement of pixels is a Turtle, it looks for metrics that help separate these images in mathematical space.

How about color?

If we measure the amount of yellow in the image, that will put the tree to the left and the turtle to the right in one-dimensional space. Our yellowness metric isn’t very good at separating turtles from trees. We need a different variable. Let’s add a dimension for roundness.

Now we’ve got a 2D space with the round tree up and the turtle down. But if we look at more data, we may come across a turtle that’s round, and a tree that isn’t. Maybe there’s some way to measure shininess. Turtles are usually shinier than trees, now we created a 3D space with three variables. And ideally, when we get a new image, we  can measure those 3 variables and see whether it falls in the turtle region or in the region of the tree.

If we want our model to recognize, not just trees and turtles, but…all these other things. Yellowness, roundness, and shininess don’t capture what’s distinct about these objects. That’s what deep learning algorithms do as they go through all the training data. They find more and more variables, create more dimensions, that help improve their performance on the task and in the process. A mathematical space with over 500 dimensions.

We as humans are consciously incapable of picturing this multidimensional space, but these AI-models use it. This is called latent space. Those 500 dimensions, or axis, represent variables that humans wouldn’t even recognize or have names for, but the result is that the space has meaningful clusters: a region that captures the essence of turtleness.

A region that represents the textures  and colors of photos from the 1910s. An area for chess and an area for players, and chess-players somewhere in between. Any point in this space can be thought of as the recipe for a possible image. The text prompt is what navigates us to that location. But then there’s one more step. Translating a point in that mathematical space into an actual image involves a generative process called diffusion. It starts with just noise and then, over a series of iterations, pixels are arranged into a composition that makes sense to us humans.

Because of some randomness in the process, it will never return the same image for the same prompt, every trial will generate another slightly random image. And if you enter the prompt into a different model designed by different people and trained on different data, you’ll get a different result. Because you’re in a different latent space.

Every AI generated image is located on certain coordinates within the model it was generated with, so if you have a generated image, you can always find its location back in the system.

Prompting Mirrors of the Self

The latent space of these models contains some dark corners that get scarier as outputs become photorealistic. It also holds an untold number of associations that we wouldn’t teach our children but that the AI learned from the internet. If you ask an image of the boss, it gives you a bald white guy. If you ask for images of nurses, they're all women. We don’t know exactly what’s in the datasets used by these AI-companies, but we know the internet is biased toward the English language and western concepts, with whole cultures not represented at all.

It really is a mirror held up to our society and what we deemed worthy enough to share on the internet in the first place and how we think about what we share.

We are on a voyage here; this is a bigger deal than just the immediate technical consequences. It's a change in the way humans imagine, communicate, work with their own culture.

Part Deux: Running up the Tower of Babel

Babel's Image Archive, Image #8342365033547796

The Babel's Image Archive runs an algorithm that creates a randomized landscape of pixels in a 640 by 416-pixel frame using 4096 different colors. The archive contains 4096266240 unique transfigurations. You can upload any image you have on your computer and get a slightly pixeled version of it in return with a string of numbers that corresponds to its location in the archive.

It contains a pixelated picture of the day you were born; it contains an image of every piece of art that currently exists, every piece of art that has ever been created, and every piece of art that could ever be made. It contains every frame of a theoretical low-resolution movie of the universe from the beginning of time to the end, from every conceivable perspective.  In fact, it contains every image that could ever exist. You can upload an image and find its location, or if you have 10961748 years on hand you can click on the universal-slideshow and simply wait for your image to appear.

This web page is a different section in another more famous website called the Library of Babel, which is based on the short story by Borges, an Argentine writer who often grappled with the idea of infinity.

In the story, The Library of Babel is a seemingly infinite construction of hexagonal walls containing shelves upon shelves of books filled with every possible combination of characters that could fit in 410 pages. It contains everything that could ever be written, Homer, the complete history of the world, a description of your birth as it will occur as well as many false descriptions of your death. Since there is no filter for meaning, as you can imagine, the library is overwhelmingly filled with noise.

Monkeys on typewriters

If you have a set of computer-programs, or just one computer-program randomly hitting keys on a typewriter for an infinite amount of time it will almost surely type any given text. It will also almost surely type every piece of text that could ever be written,

The problem is that the probability of even enough computers filling up the entire observable universe typing away for a period hundreds of thousands of orders of magnitude longer than the age of the universe, successfully typing the 6 remaining instalments of Edwin Drood is so low it might as well be zero, but technically it isn’t.  Despite this, some have attempted experiments with many fundamental  constraints.

Real monkeys typing pose a different challenge. The simple shape of the keyboard would cause uneven distributions of which keys are hit, also...the monkeys would probably hit a few keys repeatedly, get bored, and then piss on the machine.

The canvas, that if you are impossibly lucky could reveal to you the most meaningful images of your life. It could reveal the most powerful work of art you have ever seen. But you will never find anything there, you will just find endless pictures of noise.

You would be extremely lucky to find even a single coherent sentence in the library, an image with a concentration of pixels that even approaches looking deliberately made. We could ask a million people to watch their computer screens running the Universal Slideshow just to find something interesting, even just a small group of pixels of the same color, something that isn’t pure random noise.

These libraries house basically every piece of creative work a human being could make, naturally it questions the nature of originality. If you ever worry that a story or movie, you are planning out isn’t original enough, you’re right, it literally already exists...somewhere.

1st Matter

In the short story by Jorge Luis Borges, where a group of “Purifiers” would go around the library and condemn entire walls of books, throwing them down the infinite shaft, getting rid of everything they deemed worthless.

But how does one construct a machine to look for meaning? What if there is a text that uses very little real words but is nonetheless extremely moving somehow? How about all the words that haven’t been invented yet? How do you find the truth in a sea of noise?  What about abstract artworks?

You may have seen some of the many oddly mesmerizing and musically beeping videos that try to visualize the different ways a computer can sort different elements into an ordered list. Algorithms: Heap Sort, Quick Sort, Bubble Sort, all with the same goal; to take a bunch of random elements and order them. And then there’s Bogo Sort, easily the most popular sorting algorithm for these videos, not because it’s any good, quite the opposite, because it is the worst, most useless sorting algorithm. It’s something of a joke.

While Bubble Sort is considered the generic bad algorithm because it is very inefficient, Bogo Sort is as inefficient as someone could possibly get. Bogo Sort takes whatever elements you give it, reshuffles them randomly, checks them, and if they aren’t ordered, reshuffles them again until they are ordered. It’s kind of like someone in the Library of Babel going to a random shelf, picking up a book and expecting to find Homer.

The interesting thing as many have pointed out is that Bogo Sort is the fastest sorting algorithm of them all…if you are astronomically lucky. It could shuffle it perfectly on its first try. It’s time to complete its sort is 0 to infinity. The Quantum-Bogo-Sort is even more guaranteed. It generates all possible permutations in every universe and simply destroys every universe except for the one it is sorted in. Bogo-Sort has a certain allure, people treat it as if it were a person. Some users have uploaded ludicrously long videos and commenters point out timestamps where Bogo gets so close but ultimately fails. There’s this psychological itch that remains with all these impossible odds.

Maybe Bogo will work instantly this one time that the libraries will reveal something to me because I’m special in some way and...well the odds aren’t technically zero. It’s the same mentality that makes the lottery work.

The great thing about art is that I rarely feel like someone is simply filling in some slot in some universal list, as if we were just computers generating permutation after permutation of every work of art that could be made until we happen to bounce into the “Ultimate Artwork.”

The reality is  that The Image Archive of Babel - The Library of Babel - The Audio Library of Babel, all contain everything that could ever be written, imagined, or heard but you will never find anything useful. You simply can’t. They’re about as useful in finding something of meaning as saying: “the meaning of life exists somewhere”. The only way you find meaningful art in The Library is by compiling it yourself.

Imaginational Theory VI: AI & Art, Idea-Set-Match Compiled by M. Moonen 08.2022 [EDUCATIONAL PURPOSE ONLY] Triple-A Society, M. Production

Critic:  “AI-Image-text-prompt-generators are just machine-learning-tools after all, like a pencil, a camera. Don't forget in the 19th century the most brilliant minds said the camera is going to end art. Well, it didn't.

First you see the picture: that is a boring dog, a replica of an old bird, or an advertisement. The true question is, is it good art or is its bad art? A surrealistic parrot as a variation of pictures of old parrots combined with Dali, Golden Earrings, this is a copy of Vermeer, this is crappy. It has no imagination or creativity. If they just could program something like the end of Edwin Drood, written by Marcel Proust, and it does not come out looking like typical steampunk...”