The “New Alchemy” of Generative Modeling

In the very early days of the study of physics, natural philosophers extended their world-understanding efforts to the pursuit of immortality-granting elixirs, cure-all potions, and the transmutation of base metals into precious metals. In this latter application, the precious targets were typically gold and silver, and the respective base sources were most often lead, copper, or mercury. In alchemical parlance, this conversion was known as chrysopoeia when gold was the target and as argyropoeia when silver was the target. A mythical substance known as the philosophers’ stone was thought to serve as a catalyst to make these conversions possible.

Despite scattered legends to the contrary, it is generally accepted that no such transmutations took place in alchemists’ labs. This was not because the task is impossible, though. It turns out that converting lead into gold is doable, although at such a staggering cost and at such a paltry yield that you’re better off just mining it yourself or buying it from someone else. Despite the considerable experimental skill of the alchemists, a fundamental piece of knowledge was unavailable to them at the time, namely about the different atomic configurations of the source and target metals they were working with. That knowledge gap only closed as natural philosophy matured over the centuries into physics, chemistry, and the modern natural sciences in general.

Within machine learning, there is a new sort of alchemical pursuit, namely that of transmuting plentiful, cheap “base data” into rare, expensive “precious data.” It goes by the name generative modeling, or, more precisely, implicit generative modeling (IGM).1 Recent work in IGM has caught the public’s attention, with varying levels of excitement and concern. IGM can produce realistic images of people who do not exist,2 extremely convincing dialogue in response to user input, high-quality custom images tailored to a user’s wildest descriptive input, and even music that matches a text description. Neighboring applications in current research include various techniques to transform already-coherent signals within the same modality, such as in voice cloning/conversion and face swapping (a.k.a. “deepfaking”).3 Similar though they are, I will consider these methods distinct from the general IGM problem.

In effect, IGM is about converting randomness into order, since the “precious” data most often encountered in current research is coherent text, audio, or imagery. But ultimately, IGM is really about converting one type of randomness into another type of randomness, just configured differently. The most challenging part of appreciating this process is accepting the idea that something so orderly and meaningful to us as natural images or human language can be considered random.

Take images, for example. Readers who are old enough will remember being impressed when digital cameras advanced enough to take megapixel images, which in color translates to 1024 by 1024 by 3 pixels, with the last number corresponding to the red, green, and blue (RGB) channels of the image.4 In digital computers (which include digital cameras), color intensity values are discretized, generally over a range from 0 to 255. An all-black image would have RGB pixel values of [0,0,0] and an all-white image would have values of [255,255,255]. All other colors are somewhere within this $256^3$ cubic-unit box.

Consider for a moment the space of all possible megapixel color images. The number of all such images is finite but stupendously large,5 but it includes a megapixel-size representation of everything that can be captured as an image, from honest visual depictions of everything that has ever happened and everything that will ever happen to complete alternate-universe fabrications and fantasies. Every page of every book, spectrograms of any sound possible, scores for unwritten Mozart symphonies … they all exist in this space of possibilities.

We can naïvely assume that all such images are uniformly distributed, meaning that each possibility is equally probable. This is a perfectly valid data distribution, even if it differs from the target we have in mind. We know it at least contains the target we have in mind. And it turns out to be trivial to create images from this distribution. In fact, in just a few lines of code, you can write a routine to exhaustively run through every one of these image possibilities and save the result to your hard drive. The problem, though, is that even if you could generate 1000 of these images per second, it would take you over $10^{10^{6.49}}$ times the age of the universe to cover them all. (That’s a one with over three million zeros after it.) Then there’s the issue of storage space.

Intuitively, we know that the vast, vast majority of these images will be garbage, not corresponding to anything even remotely interesting. In fact, we suspect that the ratio of “interesting” images to all possible images is a number so small that it is practically zero, even if the number of “interesting” images is itself enormous. The number we’re dividing by is just that much more enormous.

The issue is that what we find interesting—what corresponds to objects in reality—has a structure to it that is not reflected in the naïve assumption of equal probability for each possible combination of RGB values. We know that the distribution of interesting images is shaped differently than the solid hypercube suggested by the uniform distribution, but we have no idea what it looks like (independent of our inability to visualize higher dimensions). Nevertheless, it no longer seems so strange to think about natural images as being drawn from probability distributions, even if we don’t know what those distributions are.

IGM is all about learning how to approximate these distributions through clever means without ever being able to explicitly write down what the distributions are. Those “clever means” are essentially philosophers’ stones for the alchemy of generative modeling and typically consist of a variety of metrics for measuring the distance to a distribution or the distance between one distribution and another. The math that takes place under the hood in modern generative models serves to optimize these metrics, catalyzing the process of shrinking that distance to as small as possible. The remarkable thing is that it works.

In my next post, I will begin the first of a series of discussions about the “philosophers’ stones” that modern IGM works with. As this series develops, I will discuss some rarely mentioned (or at least underexplored) connections among these quantities and other objects studied elsewhere in the mathematical literature. It will also become evident that many approaches to IGM wind up applying a slightly different flavor of essentially the same technique.

Notes and References
  1. In other uses within statistics, a generative model is an explicit model of the data distribution, in that it assigns a probability (or probability density) to each piece of data. By contrast, an implicit generative model creates data drawn from that distribution but does not necessarily assign a numerical density or probability to it. Intuitively, the difference between the two is the same as the difference between taking a sample of data and getting the values of its probability density under the normal distribution versus generating a random sample from the normal distribution.
  2. This work is based on StyleGAN2 from researchers at NVIDIA.
  3. This is a shameless self-citation.
  4. In computer science and digital technology, one typically uses the nearest power of two to whatever you are interested in. So although kilo corresponds to one thousand of something, a kilobyte is $2^{10} = 1024$ bytes, not 1000 bytes.
  5. It comes to $256^{1024 \times 1024 \times 3} \approx 7.8 \times 10^{7575667}$, which is finite but might as well be infinite from a practical standpoint.






Leave a Reply