The Long Walk Home: Statistical Distances, Part 1

A journey of a thousand miles, says the old Chinese proverb, begins with a single step. Ideally, that first step should be to establish how long the journey will be. For one thing, you’ll want to make sure that you have the right shoes and have packed enough snacks. But perhaps more importantly, knowing the distance you have to travel is also useful when even the direction in which to travel is uncertain.

Hot or Cold

This idea is baked into the childhood game of “Hot or Cold,” in which a target object is hidden from a player, whose only clues as to where it is are given by his or her opponent’s feedback. “You’re getting warmer” means the player is moving toward the object. “You’re getting colder” means the player is moving away from it. Being able to repeatedly check a numerical distance to a target while underway is much like playing “Hot or Cold.” If the distance goes down, you’re getting warmer; if it goes up, you’re getting colder.

In implicit generative modeling (IGM), we also have a target object, namely that of a probability distribution over some sort of interesting data, whether it be images, text, audio, or something else entirely. It will almost always be the case that we have no idea what this distribution is—meaning that we can’t write down a formula for it—and have only a sample of examples drawn from it. For example, if we are interested in modeling the distribution over pictures of cats, we will want to have lots of pictures of cats as our representative sample.

The M in IGM stands for modeling, so our goal is to build a model that can mimic this distribution. These days, this model is very likely to be some sort of deep neural network, which will take some sort of input and produce some sort of output with dimensions that match our target. By dimensions I mean size here. In my previous post, I gave an example of megapixel images, which have dimensions of $d = 1024 \times 1024 \times 3$ pixels, so we will want the output of our model to be $d$-dimensional,1 or, more precisely, to be in $\mathbb{R}^d$, the space of $d$-dimensional real vectors.

By design, our model produces output that has the right size, but there is next to no chance that it will match any of the other qualities of the target data right off the bat. It will produce something, and that something will have its own probability distribution, but we suspect that this distribution is pretty “far” from the target. But what do we mean by “far”? This is where the concept of a statistical distance comes in.

The Making of a Distance

Not all statistical distances—and there are a lot of them—are true distances. In mathematics, a distance (or metric) $d(x,y)$ must have the following properties:

  1. The distance should be zero if you’re already at your destination: $d(x,x) = 0$.
  2. The distance should be positive everywhere else: $d(x,y) > 0$ if $x \neq y$.
  3. The distance from $x$ to $y$ is the same as from $y$ to $x$: $d(x,y) = d(y,x)$.
  4. Stopping off somewhere else never shortens the distance: $d(x,y) \leq d(x,z) + d(z,y)$.

Properties (1) and (2) are sometimes referred to as the identity of indiscernibles condition, while (3) is a symmetry condition, and (4) is known as the triangle inequality.

Let the (unknown) target probability distribution be represented by $p$. For a data point $\mb{x} \in \mathbb{R}^d$, computing $p(\mb{x})$ would give us the probability density2 of the point $\mb{x}$. Of course, since we don’t know $p$, we can’t actually perform this computation.3 As soon as we initialize it, our not-yet-trained model is producing something, which we will say is distributed according to $q_t$, where the $t$ subscript indicates that we expect (and want) $q_t$ to change over time. We also don’t know what $q_t$ is. What we want to know is how far $q_t$ is from $p$ at any time $t$ so we can figure out if we’re hot or cold and eventually achieve our goal, namely $q_t = p$. But how can we know the distance between two distributions when we don’t even know what either distribution is?

We seem to have no option other than to estimate something from the data. Thinking back to statistics class, we recall something about testing whether two means are equal,4 and this gives us an idea: We’ll define a distance between $p$ and $q_t$ as $$d(p, q_t) = \| \E_{\mb{x} \sim p} [\mb{x}] – \E_{\mb{y} \sim q_t}[\mb{y}] \| = \| \bar{\mb{x}} – \bar{\mb{y}} \|,$$ where $\E$ indicates the expectation (averaging) operator and the vertical brackets indicate the norm. In other words, we want to represent the “distance” between $p$ and $q_t$ by the norm (i.e. the Euclidean distance) of the difference of the averages of the respective data samples. It’s easy to calculate, since all we have to do is compute the average over the sample of target data, getting $\bar{\mb{x}}$, and then generate a bunch of samples from our model and average over those, getting $\bar{\mb{y}}$. The distance between these averages is then the distance between the distributions, and the procedure we have carried out is known as a two-sample test. Mission accomplished, right?

Not quite. While it’s true that, defined this way, $d(p, q_t) = 0$ if $p = q_t$, it is not true that $d(p, q_t) = 0$ only if $p = q_t$. In fact, there are infinitely many distributions that aren’t $p$ but share its mean. To see this, just replace every generated data point $\mb{y}$ with the translation $\mb{y}’ = \mb{y} – \bar{\mb{y}} + \bar{\mb{x}}$ (that is, subtract off the mean of the $\mb{y}$ variables, which makes their mean zero, and then add the mean of the $\mb{x}$ variables), and $\| \bar{\mb{x}} – \bar{\mb{y}}’ \|$ will vanish. To be fair, making this simple translation will move $q_t$ closer to $p$ in some sense, but it’s not good enough for our purposes.

We also see that we have violated condition (2) of what a distance should be, since our “distance” can vanish even when we’re not at our destination. It turns out, though, that many perfectly acceptable measures of one distribution’s proximity to another violate at least one of the criteria for “distancehood” defined above (ideally not the second condition, though). In this case, we then often call them something else, such as discrepancies, divergences, semimetrics, or pseudometrics.

The Principle of Moments

The problem with our naïve definition is that the mean is just one of many—potentially infinitely many—statistical moments, which describe the shape of a distribution. Moments are just expectations (averages) of functions of the data. Technically, they can be any function, but they are most commonly powers of the data (i.e. $\E [ x^k ]$) or powers of the data minus the mean (i.e. $\E[(x – \mu)^k]$), in which case they are referred to as central moments. They tell you everything you need to know about the distribution,5 and matching just one of them won’t cut it. That’s because while moving a distribution to match another distribution’s first moment (the mean) will give it the right general location, the shape of the distribution won’t be right unless the other moments are matched as well.

If moments are just averages over functions of the data, are there functions that one can use to capture all the moments at once? Yes, there are. Consider the function $\exp ( \langle \mb{t}, \mb{x} \rangle )$, where $\langle \mb{t}, \mb{x} \rangle = \mb{t}^\top \mb{x} = \sum_{i=1}^d t_i x_i$ is the inner product. If we expand out the expectation of this function, we have $$M_p(\mb{t}) = \E_p \left[ \exp ( \langle \mb{t}, \mb{x} \rangle ) \right] = \sum_{k=0}^\infty \frac{\E_p \left[ \langle \mb{t}, \mb{x} \rangle^k \right]}{k!},$$ and we see that expectations of all powers of the data are contained in the expansion, which suggests that the moments might also be encoded in there somehow.

$M_p(\mb{t})$ is known as the moment generating function (MGF) of the distribution $p$. The function $M_p(\mb{t})$ gets its name because computing the $k$th derivative of $M_p(\mb{t})$ with respect to $\mb{t}$ and evaluating it at $\mb{t} = 0$ yields the $k$th moment of $p$. That is, $$\mu_k = \mb{D}^{(k)}_{\mb{t}} M_p(\mb{t}) \vert_{\mb{t} = 0},$$ where $$\mb{D}^{(k)}_{\mb{t}} = \frac{\partial^k}{\partial t_1^{\iota_1} \partial t_2^{\iota_2} \cdots \partial t_d^{\iota_d}}$$ is the multivariate differential operator with respect to the variable $\mb{t} \in \mathbb{R}^d$, with $\iota_i \in \{ 0,1 \}$ and $\sum_{i=1}^d \iota_i = k$. For $k=1$, this corresponds to the gradient. For $k=2$, it returns the $d \times d$ Hessian.6

The whole trick of moment generation through taking the derivative of the MGF is hardly obvious in the multivariate case, though, especially since for $k > 2$, these derivatives become multidimensional arrays that are difficult to wrap one’s head around, let alone visualize. It is far easier to appreciate this property in the univariate case, since $$M_p(t) = \sum_{k=0}^\infty \frac{t^k \E_p \left[ x^k \right]}{k!},$$ and it becomes much more straightforward to verify that the $k$th derivative evaluated at $t = 0$, $\frac{\D^k}{\D t^k} M_p(t) \vert_{t=0}$, yields $\mu_k = \E_p [x^k]$.

From Moments to (Almost) Metrics

The multivariate or joint MGF captures all of the information necessary to uniquely identify a distribution, including any dependencies among the dimensions. Two distributions’ MGFs are equal at all arguments if and only if they are identical. So we may define a distance between distributions by comparing their MGFs for various arguments $\mb{t}$. That is, we might define $$d_{\mb{t}}(p, q_t) = | M_p(\mb{t}) – M_{q_t}(\mb{t})|^2$$ as our measure for a specific value of $\mb{t}$ and then average over many values of $\mb{t}$ to get a global measure: $d(p, q_t) = \E_{\mb{t}} [d_{\mb{t}}(p, q_t)]$. This will vanish if and only if the distributions are equal.

But let’s not speak too soon: The MGF captures all of the information necessary to uniquely identify a distribution … if the MGF exists. And it doesn’t always exist. In order for it to, the expectation needs to converge for some $\mb{t}$ in the neighborhood of the origin (that is, around 0). And not every distribution has an MGF because this condition fails. Well-known examples include the log-normal distribution and the Cauchy distribution.

So now what? As it turns out, a simple modification to the function of the data that we use to compute moments is guaranteed to converge for all arguments and all distributions. The resulting function is known as the characteristic function, and its definition is $$\varphi_p(\mb{t}) = \E_p [\exp \iu \langle \mb{t}, \mb{x} \rangle ] = \E_p [\cos \langle \mb{t}, \mb{x} \rangle ] + \iu \E_p [\sin \langle \mb{t}, \mb{x} \rangle ],$$ where $\iu = \sqrt{-1}$ is the imaginary unit. The characteristic function yields a complex number with real (the cosine part) and imaginary (the sine part) components. Since each of those components is bounded (the cosine and sine functions never go outside the range $[-1, 1]$), the expectation is guaranteed to exist for any argument $\mb{t}.$ Like the MGF, the characteristic function of two distributions is equal at all arguments if and only if the distributions are equal.

Now a similar distance as defined above for the MGF will work for every distribution. That is, we can define $$d(p, q_t) = \E_{\mb{t}} \left[ | \varphi_p(\mb{t}) – \varphi_{q_t}(\mb{t}) |^2 \right]$$ and once again get a measure that will vanish if and only if $p = q_t$ with the additional advantage that we can calculate it for any distributions $p$ and $q_t$. Perhaps most important from a practical standpoint, we can also estimate $d(p, q_t)$ from data samples alone, since it only requires averaging over the data.7 That means that, without even knowing the form of $p$ or $q_t$, we have enough information to play the “Hot or Cold” game to steer $q_t$ toward $p$, at least in principle.

Despite all the legwork we’ve done so far, we have only scratched the surface on statistical distances, and at this point we haven’t yet covered how they directly apply to the task of IGM. In particular, we don’t yet know how to leverage this distance information to decide on the best route to send the distribution $q_t$ toward the distribution $p$. We will expand on this topic in the next post, where we’ll discuss the role of classifiers in establishing distances between distributions, briefly cover the Kullback-Leibler divergence and its cousins, and introduce something we will refer to as the difference in scores, a quantity that will pop its head up over and over again in our exploration of the mechanics of IGM.

Notes and References
  1. You can either think of $d$ as a list containing the height, width, and channel depth of the image or as a the single number $d = 3145728$ after multiplying everything out. This latter (and common) choice represents the data as a vector, which is what you would get if you reshaped the image by stacking all the pixels together in a single column.
  2. Probability density is something that comes up with continuous (versus discrete) data. It is equal to probability mass per unit volume. You actually have to integrate the probability density over a certain area/volume to get the probability of that area/volume, just as you would have to know the volume of a substance to calculate its mass from its density. But since the density is proportional to the probability, for intuition’s sake, it may be best to just think of it as a probability.
  3. If I could make the calculation, I would have a generative model in the old statistical sense of the word. See my previous post.
  4. More specifically, not “significantly” far apart.
  5. Technically, on unbounded intervals, the collection of moments does not uniquely determine the distribution, something known as the Hamburger moment problem. However, this is of no real practical concern to us, since other methods of collecting the moments, which are introduced in this and subsequent posts, do uniquely identify a distribution.
  6. For $k \geq 3$, this operator returns a $\overbrace{d \times d \times \cdots d}^{k}$ tensor, and here things start to get more complicated. The indices $\iota_i$ tell you what components of $\mb{t}$ you are taking the derivative with respect to. For example, if $k = 3$, you have a $d \times d \times d$ tensor whose $(i,j,k)$ entry is $\frac{\partial^3 M_p(\mb{t})}{\partial t_i \partial t_j \partial t_k}$, which is found by setting $\iota_i = \iota_j = \iota_k = 1$ and every other index to zero. (Confused yet?) At any rate, all of the data moments are indeed encoded in the MGF, even if they are a painful chore to recover via differentiation.
  7. In fact, if we draw $\mb{t}$ from a normal distribution, the quantity $d(p, q_t)$ defined using the characteristic function is equivalent to something called the maximum mean discrepancy (MMD), whose derivation comes from the functional analysis literature. We will explore the MMD in a future post.

Comments

Leave a Reply

%d