Although I run the risk of abusing the metaphor when it comes to describing implicit generative modeling (IGM) as alchemy, I remain fond of this viewpoint of using IGM to transform digital “lead” (the cheap and plentiful source data that we can easily sample) into digital “gold” (the precious target data that we want to create more of). The various loss functions being optimized in different IGM implementations—GANs, VAEs, diffusion models, etc.—play the role of the *philosophers’ stone* whose purpose is to catalyze this transformation.

In this post, I will discuss what I think are several interesting insights into IGM alchemy that are contained in a paper I recently published.^{1} I will package and present them a bit differently here, taking advantage of the looser format that a blog permits. The main focus, both here and in the paper, is on the *score difference* between the distribution you want your data to follow and the distribution your data *currently* follow. The score difference defines the best direction for your source data to head in order to get to the target.

I show that you can entirely circumvent working with the original source and target distributions in favor of working with convenient proxy distributions that result from adding noise to the data, since aligning these proxies aligns the original distributions essentially for free. I also discuss theoretical connections between the score difference and other IGM approaches—such as diffusion models, GANs, and kernel-based variational inference methods—each of which individually addresses different challenges of the so-called *generative modeling trilemma* but address them *all* when taken together.

\( \newcommand{\b}{\mathbf} \newcommand{\D}{\mathrm{d}} \newcommand{\R}{\mathbb{R}} \)

### Pushing Digital Lead Toward Digital Gold

The most straightforward way to consider lead-into-gold alchemy is to work directly with the raw materials. That is, we have a lump of lead, and we want to perform some type of work on it to turn it into gold. Let’s say that our “lead” is represented by $\b{y} \sim q.$^{2} Here $q$ is some source distribution that is easy to sample from, such as the standard Gaussian or the uniform distribution in $\R^d.$ Our “gold” is represented by $\b{x} \sim p$, where $p$ is some distribution whose form is typically unknown to us. It could be the distribution of certain types of images, for instance, whose only representation is in the finite set of examples we have been able to collect.

Sampling from $p$ is easy in a sense, in that all we have to do is randomly draw one of the examples we collected. However, we only have so many of these examples, and we’ll eventually exhaust our supply and start drawing duplicates.^{2} What we want is to create an *inexhaustible* supply of samples from $p$, despite our limited set of examples. To do that, we want to sample from something truly inexhaustible, such as random Gaussian or uniform noise, and work some magic on it to make it look like it came from $p.$

There is a process that is guaranteed to produce these samples, but it requires a bit of patience. You start by sampling $\b{y}_0 \sim q$ (the cheap stuff) and “push” that sample in the direction of $p$ (the good stuff): $$\b{y}_{t+1} = \b{y}_t + \frac{\gamma(t)^2}{2} \nabla_{\b{y}_t} \log p(\b{y}_t) + \gamma(t) \b{\epsilon}.$$ Here $\nabla_{\b{y}_t} \log p(\b{y}_t)$ is the gradient of the log density of $p$ evaluated at the current location of the sample, $\b{y}_t.$ This is also known as the *score* function of the density $p$,^{3} and it points in the direction of the greatest increase in the probability density $p$ from any input point within its support.^{4} The term $\gamma(t)$ is a positive weighting term that should vanish as $t \to \infty$, and $\b{\epsilon} \sim \mathcal{N}(\b{0},\b{I})$ is standard spherical Gaussian noise.

The process described above is known as *Langevin Monte Carlo* (LMC). It is the discrete version of a more general continuous Îto diffusion* *given by the stochastic differential equation (SDE) $$\D \b{y} = \b{\mu}(\b{y},t) \D t + \gamma(t) \D \b{\omega},$$ where $\b{\mu}: \R^d \to \R^d$ is a deterministic *drift* term and $\gamma(t)$ is a *diffusion* coefficient that controls the random influence of a standard Wiener process (Brownian motion) represented by $\b{\omega}.$

LMC sets the drift term to $\b{\mu}(\b{y}_t,t) = \frac{\gamma(t)^2}{2} \nabla_{\b{y}_t} \log p(\b{y}_t)$, which causes a sample to drift in the direction of the highest increase in the probability under the target distribution, $p.$ The diffusion term keeps the sample from collapsing to local maxima (i.e. *modes*) in the probability density by introducing enough jitter to explore the full support of the distribution.^{5}

The process of pushing samples from $q$ until they become samples from $p$ is *reversible*. That is, one can take samples from $p$ and turn them into samples from $q.$ This is perhaps not surprising if we consider simply running the LMC process again, this time using $q$ as the target and drawing the initial sample $\b{x}_0$ from $p.$ So if $q$ is the standard Gaussian, then the process described above becomes $$\b{x}_{t+1} = \b{x}_t – \frac{\gamma(t)^2}{2} \b{x}_t + \gamma(t) \b{\epsilon} = \left(1 – \frac{\gamma(t)^2}{2} \right) \b{x}_t + \gamma(t) \b{\epsilon},$$ since $\nabla_{\b{x}_t} \log \mathcal{N}(\b{0},\b{I}) = -\b{x}_t.$ (See the next section before you get too sold on this procedure, even though it works.)

Recalling that $\gamma(t)$ is decreasing and vanishes as $t \to \infty$, we see that we are gradually turning up a “fader switch” on $\b{x}_t$ and fading out the additive Gaussian noise over time. This makes sense, as $\b{x}_t$ is becoming “more Gaussian” over time and should therefore have greater influence as the process unfolds. However, for those familiar with diffusion models, this process might look strange. After all, the forward noising process in diffusion models, which destroys the structured data and leaves Gaussian noise in its place, features a *decreasing* role for $\b{x}_t$ over time, eventually fading it out completely: $$\b{x}_{t+1} = \sqrt{1 – \beta_t} \b{x}_t + \sqrt{\beta_t} \b{\epsilon},$$ where $\beta_t$ is monotonically *increasing* in $[0,1].$ But this process also makes sense, because we are fading out the data and fading in random noise from the Gaussian distribution that serves as our target.

In other words, there are at least two valid ways of getting from $p$ to $q$, each one with nearly the opposite strategy of the other. The first turns up the data and turns down the noise, and the second turns down the data and turns up the noise. And both processes arrive at their target destination, the standard Gaussian distribution.

### Enter the Score-Difference Flow

There are even more ways to get from $p$ to $q$ and to get from $q$ to $p.$ The difference is what the *intermediate* distributions look like along the way. If we are pushing $p$ toward $q$, we would indicate the intermediate stage of this evolution by $p_t$, and $p_{\infty} = q.$ Similarly, if we are pushing $q$ toward $p$, we would write $q_t$ for the intermediate distributions, and $q_{\infty} = p.$

It turns out that the naïve approach of running the LMC procedure using $q$ as the target and $p$ as the source, while effective, is not the same as truly *reversing* the LMC procedure with $q$ as the source and $p$ as the target. That is, if you compare the intermediate distributions at various times $t$ along the way, they won’t be the same in both procedures. You might ask why we should care about the intermediate distributions as long as we reach the desired target. That’s a good question, but considering a true reversal process winds up having some theoretical advantages, which we’ll discover below.

There is indeed a procedure that reverses the forward LMC process and maintains the same marginal intermediate distributions along the way, like playing the movie of the forward process in reverse.^{6} This process is described by the SDE $$\D \b{x} = \left[ \b{\mu}(\b{x},t) – \gamma(t)^2 \nabla_{\b{x}} \log q_t(\b{x}) \right] \D t + \gamma(t) \D \hat{\b{\omega}},$$ where now time is running in *reverse* (just flip the sign to keep the clock running in the usual direction) and $\hat{\b{\omega}}$ indicates a reversed-time Wiener process. The details are less important than recognizing that the reversal of a diffusion is *another* diffusion.

A remarkable fact is that there is a *deterministic* process that maintains the same intermediate distributions. It looks pretty much the same as the stochastic version with the exception of shaving off the diffusion term and altering the weight of the score of the intermediate distribution by a factor of two. It is given by the *probability flow* ordinary differential equation $$\D \b{x} = \left[ \b{\mu}(\b{x},t) – \frac{\gamma(t)^2}{2} \nabla_{\b{x}} \log q_t(\b{x}) \right] \D t.$$

Recalling from earlier that in LMC we have $\b{\mu}(\b{x},t) = \frac{\gamma(t)^2}{2} \nabla_{\b{x}} \log p(\b{x})$,^{7} we have $$\D \b{x} = \frac{\gamma(t)^2}{2} \left[ \nabla_{\b{x}} \log p(\b{x}) – \nabla_{\b{x}} \log q_t(\b{x}) \right] \D t.$$ This defines the *score-difference *(SD) flow between $p$ and $q$ in reverse time and the SD flow between $q$ and $p$ in forward time. This is a *deterministic* process for aligning a source distribution with a target distribution, which keeps from collapsing into the modes of the target distribution despite not using a diffusion term, which it manages by leveraging the score of the intermediate distribution. Intuitively, it works by balancing the goals of moving *toward* the target and moving *away* from the source.

### The Best Way from ~~A to B~~ $q$ to $p$

Is there a *best* way to move one distribution toward another? When we’re traveling from one place to another, we often consider the *fastest* route to be the best one. That means that we are covering the most distance per unit time. Recall that when we are talking about probability distributions, we can leverage the concept of a *statistical distance* to describe their separation. So we’re ultimately looking for a route that reduces a statistical distance between the source and target distributions as fast as possible.

The Kullback-Leibler (KL) divergence is perhaps the most famous of the statistical distances (despite not being a true *distance* in the mathematical sense). I won’t go into the details here, as they are in the paper, but it turns out that the SD flow defines a path that optimally reduces the KL divergence between $q_t$ and $p$ at each step and is guaranteed not to increase the KL divergence.

*[Click video to play.] The SD flow solves the Schrödinger bridge problem. Unlike diffusion models, which require a Gaussian prior distribution, the SD flow can interpolate between any two probability distributions.*

*Here the interpolation is between the “question mark” distribution and the “Swiss roll” distribution in $\R^3.$*

The SD flow also solves something known as the *Schrödinger bridge problem*. A Schrödinger bridge defines the most likely evolution between two probability distributions given a reference evolution. We can think of this reference evolution as the forward process of destroying data to produce pure noise such as in the case of diffusion models, with our ultimate goal being to reverse this process. However, unlike diffusion models, Schrödinger bridges define evolutions between *arbitrary* distributions, such as in the example shown above, with no restrictions placed on either the source or target distribution.

The literature on Schrödinger bridges and optimal transport is fairly complicated, even in the applied IGM setting. However, showing that the SD flow solves the Schrödinger bridge problem is relatively easy. The details are in the paper, but the basic takeaway is that solutions to the Schrödinger bridge problem follow SDEs that look just like the diffusion-reversal SDEs given above, albeit with *potentials* in the place of the intermediate densities. But if you swap the intermediate densities in for the potentials—which you can, since the densities meet the conditions required for the potentials—then the SD flow naturally emerges from the resulting dynamics.

### Proxies Are Enough

So how does one *apply* SD flow? Earlier, I mentioned that a disadvantage of Langevin Monte Carlo is the requirement that we have access to the score of the target distribution, which we almost never do in any interesting IGM cases, such as modeling natural images. And now SD flow asks not only for *two* scores to be specified but also that one of them (corresponding to the intermediate distribution) be constantly changing over time. How is that situation any better?

A variety of methods exist in the literature for estimating score functions from data, where they come in various flavors of *score matching*. A clever derivation shows that a parametric score function can indeed be directly learned from data, although its practical use in high dimensions is limited due to its need to compute the Hessian of the log density. Other methods exploit a connection between score matching and denoising autoencoders. We will follow the denoising approach, but with one crucial difference.

When you add independent random noise to data from a distribution $p$, the noisy data that results follows a new distribution that is the *convolution* of the data distribution and the noise distribution. So, if we add Gaussian noise with variance $\sigma^2$ to data $\b{x} \sim p$, we get a new variable $\b{z} = \b{x} + \sigma \b{\epsilon}$, where $\b{\epsilon} \sim \mathcal{N}(\b{0},\b{I}).$ This variable is distributed as $$p(\b{z}; \sigma) = \int_{\R^d} p(\b{x}) \mathcal{N}(\b{z}; \b{x}, \sigma^2 \b{I}) \D \b{x} = \mathbb{E}_{\b{x} \sim p} \left[ \mathcal{N}(\b{z}; \b{x}, \sigma^2 \b{I}) \right].$$ We will call this distribution over $\b{z}$ that results from corrupting the distribution over $\b{x}$ a *proxy distribution* for the target distribution $p.$ We can define a proxy distribution for the source distribution $q$ in exactly the same way: $$q(\b{z}; \sigma) = \mathbb{E}_{\b{y} \sim q} \left[ \mathcal{N}(\b{z}; \b{y}, \sigma^2 \b{I}) \right].$$

Specifying the score as the gradient with respect to the “proxy” variable $\b{z}$ is straightforward, since we can pass the gradient inside the expectation over the original distributions. When we go through the necessary calculations for $p$, we get $$\nabla_{\b{z}} \log p(\b{z}; \sigma) = \frac{\nabla_{\b{z}} p(\b{z}; \sigma)}{p(\b{z}; \sigma)} = \frac{1}{\sigma^2} \left(\frac{\mathbb{E}_{\b{x} \sim p} \left[ \mathcal{N}(\b{z}; \b{x}, \sigma^2 \b{I}) \b{x} \right]}{\mathbb{E}_{\b{x} \sim p} \left[ \mathcal{N}(\b{z}; \b{x}, \sigma^2 \b{I}) \right]} – \b{z} \right).$$ When we recognize the first term in the parentheses as $\mathbb{E}[\b{x} | \b{z}]$, we have $$\nabla_{\b{z}} \log p(\b{z}; \sigma) = \frac{1}{\sigma^2} \left( \mathbb{E}[\b{x} | \b{z}] – \b{z} \right),$$ which is simply a rearrangement of Tweedie’s formula. The score of $q(\b{z}; \sigma)$ is derived identically, and when we calculate the score difference, the $\b{z}$ terms cancel, and we get $$\nabla_{\b{z}} \log p(\b{z}; \sigma) – \nabla_{\b{z}} \log q(\b{z}; \sigma) = \frac{1}{\sigma^2} \left( \mathbb{E}[\b{x} | \b{z}] – \mathbb{E}[\b{y} | \b{z}] \right).$$ The expression inside the parentheses above represents the difference of the *expected clean signals given the dirty signals* under the target and source distributions, respectively. In other words, it is the difference of two *denoisers*, one optimized for $p$ and the other optimized for $q_t$: $$\nabla_{\b{z}} \log p(\b{z}; \sigma) – \nabla_{\b{z}} \log q(\b{z}; \sigma) = \frac{1}{\sigma^2} \left[ D_p(\b{z}; \sigma) – D_{q_t}(\b{z}; \sigma) \right].$$

We now have the ingredients needed to align the proxy distributions $p(\b{z}; \sigma)$ and $q(\b{z}; \sigma)$ via the SD flow. But isn’t our goal to align the *original* distributions, $p$ and $q$? How do the proxies help us? One insight of this section is reflected in its heading: *proxies are enough* to do the job. I’ll next describe why that’s the case.

Recall from above that when you add independent random noise to data (Gaussian in our case), you form a new proxy distribution that is the convolution of the noise and data distributions: $\tilde{p} = p \ast \mathcal{N}.$ And the characteristic function of that new distribution will be the *product* of the characteristic functions of the noise and data distributions: $\varphi_{\tilde{p}}(\b{t}) = \varphi_{p}(\b{t}) \varphi_{\mathcal{N}}(\b{t}).$ If the characteristic functions of two distributions are equal at every argument, then the distributions are equal: $\varphi_p(\b{t}) = \varphi_q(\b{t}) \; \forall \b{t} \iff p = q.$ Since the characteristic function of the Gaussian distribution never vanishes for finite arguments, we can cancel it from each side, meaning $$\varphi_{\tilde{p}}(\b{t}) = \varphi_{\tilde{q}}(\b{t}) \iff \varphi_{p}(\b{t}) = \varphi_{q}(\b{t})$$ or $\tilde{p} = \tilde{q} \iff p = q.$ That is, **aligning the proxies is equivalent to aligning the original source and target distributions**.

An additional insight is that we can directly apply the dynamics prescribed for the proxy variable $\b{z}$ to the clean variable $\b{y}$ from the source distribution to move it closer to the target. But why?

The SD flow on the proxy distributions prescribes a direction $\Delta \b{z}_t$ to move $\b{z}$ at time $t$: $\b{z}_t + \Delta \b{z}_t = \b{z}_{t+1} \sim q_{t+1}(\b{z}; \sigma).$ But there is a *second* way to get to $\b{z}_{t+1}$, which is to first update the *clean* data $\b{y}_t$ to form $\b{y}_{t+1}$ and then add noise to it: $$\b{z}_{t+1} = \b{y}_{t+1} + \sigma \b{\epsilon} = \b{y}_t + \Delta \b{y}_t + \sigma \b{\epsilon}.$$ But this relies on the adjustment to the clean data, $\Delta \b{y}_t$, which we don’t know.

We made $\b{z}_t$ by adding noise to $\b{y}_t$ at time $t.$ If we knew the update $\Delta \b{y}_t$, we could have *saved *this noise for use at time $t+1.$ Think of the noise as coming from a queue, like numbers written on a shuffled deck of cards. We either draw a card at time $t$ and use it to make $\b{z}_t$ from $\b{y}_t$ *before* applying the update $\Delta \b{z}_t$ or we save it and draw it at time $t+1$ to make $\b{z}_{t+1}$ from $\b{y}_{t+1}$ *after* applying the update $\Delta \b{y}_t.$ But it’s the same card from the top of the deck in both cases, with $\b{\epsilon}$ written on it.

Putting it all together, we have two equivalent expressions for $\b{z}_{t+1}$, $$\b{y}_t + \Delta \b{y}_t + \sigma \b{\epsilon} = \b{y}_t + \sigma \b{\epsilon} + \Delta \b{z}_t,$$ which, when you cancel like terms, shows that $\Delta \b{y}_t = \Delta \b{z}_t.$ That is, even when we focus on aligning the proxy data, we align the original source and target data with no additional effort.

From a practical standpoint, our representations of the original target and (intermediate) source distributions, $p$ and $q_t$, are given by a finite set of samples. Our proxy distributions are represented by working copies of these samples, to which we add noise to form representations of $\tilde{p}$ and $\tilde{q}_t.$ We use those working copies to calculate the update direction to the source data prescribed by the SD flow and push that update into our *clean* samples from $q_t.$^{8} We then discard the working copies and repeat the process in step $t+1.$

### Connections to Other IGM Methods

#### Diffusion Models

The fact that the SD flow is equivalent to the difference in the outputs of two denoisers immediately suggests a connection to denoising diffusion models. Of the two denoisers, one is trained on the target data from the distribution $p$, while the other is trained on the *intermediate* distribution $q_t.$ That intermediate denoiser would, in principle, need to be retrained at every time step $t.$ This is not necessarily any more onerous than the iterative retraining of a discriminator-generator pair in GAN training, but it does require some effort.

We might instead recognize that in forming the variable $\b{z}_t = \b{y}_t + \sigma \b{\epsilon}$, we actually observe the clean signal $\b{y}_t$ before corrupting it. We can then leverage this information to form the approximation $D_{q_t}(\b{z}_t, \sigma) \approx \b{y}_t.$^{9} I show in the paper that when one makes this substitution, the approximate version of SD flow that results is equivalent to the reverse process in denoising diffusion.^{10}

#### Generative Adversarial Networks (GANs)

In a previous post, I presented an alternative perspective on GANs, which showed that the training of the GAN generator breaks into two subproblems,^{11} one of which that implicitly induces dynamics on the generated data to create new targets for the generator and the second of which fits the generator to the dynamically perturbed points via regression. This is a flow induced by the loss, such that if $\b{y} = g_{\theta}(\b{\epsilon})$ is the output of a parametric generator that takes random Gaussian noise as input, then $$\D \b{y} = -\nabla_{\b{y}} \mathcal{L},$$ where $\mathcal{L}$ is the generator loss.

For a variety of generator losses, including the widely used non-saturating loss, the induced flow is in the direction of the score difference … assuming the discriminator is optimal, which it essentially never is in practice. In other words, the SD flow emerges under often unrealistic ideal conditions in GANs but has a much easier specification under the denoiser-based definition. Relative to the GAN discriminator’s task of learning the exact density ratio, the denoising problem is comparatively easy to solve, since the ground truth is better specified in the training data.

The key point is that SD flow provides an alternative target-creation process relative to the discriminator-mediated process in traditional GANs. That is, during generator training, one can swap in the diffusion-like SD flow process to update the generator with improved targets at each step, which will cover the target distribution better than is possible when using a discriminator to learn the distribution. Once completely trained, the generator would then be able to create output in a single inference step. This leveraging of SD flow potentially addresses all three challenges of the so-called *generative modeling trilemma*: high sample quality, mode coverage, and fast sampling.

#### MMD Gradient Flow, Stein Variational Gradient Descent, and Other Kernel-Based Methods

As derived in the paper, SD flow also admits a kernel-based definition: $$\nabla_{\b{z}} \log \frac{p(\b{z}; \sigma)}{q_t(\b{y}; \sigma)} = \frac{1}{\sigma^2} \left(\frac{\mathbb{E}_{\b{x} \sim p} \left[ K_{\sigma}(\b{z}, \b{x}) \b{x} \right]}{\mathbb{E}_{\b{x} \sim p} \left[ K_{\sigma}(\b{z}, \b{x}) \right]} – \frac{\mathbb{E}_{\b{y} \sim q_t} \left[ K_{\sigma}(\b{z}, \b{y}) \b{y} \right]}{\mathbb{E}_{\b{y} \sim q_t} \left[ K_{\sigma}(\b{z}, \b{y}) \right]} \right),$$ where $K_{\sigma}$ represents the Gaussian kernel, $$K_{\sigma}(\b{z}, \b{x}) = \exp \left( -\frac{\| \b{z} – \b{x} \|^2}{2 \sigma^2} \right).$$ This definition has some similarities with both maximum mean discrepancy (MMD) gradient flow and Stein variational gradient descent (SVGD), both of which are compared against SD flow in the paper.^{12}

Although not directly addressed in the paper, the kernel-based definition of SD flow also invites comparison to kernel-inducing generative modeling approaches that have theoretical duality with certain physical processes. Among these are the Poisson flow generative model and its variants, which approach the IGM problem from an electrostatics perspective via the dynamics induced by Coulomb’s law extended to higher dimensions. This law is described by a form of Poisson’s equation, a partial differential equation that also describes Newton’s law of gravitation.^{13} Poisson’s equation induces a kernel $K$ (known as the Green’s function of the partial differential equation) that weights the contribution of data points inversely proportional to a power of their distance: $\log K(\b{x},\b{y}) \propto -\log \| \b{x} – \b{y} \|.$

The Gaussian kernel, which we use in the kernel-based definition of SD flow, is induced by a different SDE, the heat equation,^{14} which creates a different physical interpretation of the prescribed dynamics. Nevertheless, there may very well be deeper connections to these physically inspired IGM methods, which could be an interesting avenue for future work.

### Summary

- It is possible to understand the process of implicit generative modeling (IGM) by considering the
*dynamics*imposed—either explicitly or implicitly—on the source data that you are trying to transform into target data. - The
*score difference*(SD) flow prescribes optimal dynamics for pushing a source distribution toward a target distribution. - It is possible to align any two distributions via the
*Schrödinger bridge*prescribed by the SD flow. - Alignment of source and target distributions is possible by working entirely with
*proxy distributions*that are easy to create and analyze. - SD flow has key connections to various other IGM methods, but typically when those methods are operating under unrealistically ideal conditions. By contrast, SD flow is easy to apply right out of the box.
- SD flow has the potential to unify various IGM approaches and address the challenges of the
*generative modeling trilemma*.

##### Notes and References

- Weber, R.M. (2023). “The Score-Difference Flow for Implicit Generative Modeling.”
*Transactions on Machine Learning Research*(TMLR). ↩︎ - The tilde here means “distributed as” or “sampled from.” ↩︎
- The Stein score, to be precise, which differs a bit from how the score is defined in other statistical literature, where it is known as the Fisher score. The difference is that the Stein score is the gradient of the log density with respect to the vector in the space of the distribution’s support, while the Fisher score is the gradient of the log density (likelihood) with respect to the distribution’s parameter vector. ↩︎
- The
*support*of a distribution $p$ is all $\b{x}$ such that $p(\b{x}) > 0.$ ↩︎ - Consider a unimodal distribution such as the standard normal. This distribution has a single maximum, which the score will point toward from any point. Without the diffusion term, all samples would drift toward this single maximum. For more general multimodal distributions, the samples would drift toward whichever mode they are closest to. ↩︎
- But not exactly. The reverse diffusion process is still stochastic, so while the intermediate distributions will be the same at all times $t$ in the forward and backward processes in a
*global*sense, on a point-by-point basis there will be differences. There is a deterministic version of this process, introduced shortly, that makes this analogy more accurate. ↩︎ - The variable was originally denoted by $\b{y}_t$, which is a placeholder that we are free to exchange with any other variable we choose. ↩︎
- The samples from $p$, which provide the representation of the target distribution, are never altered. ↩︎
- This is just an approximation, though, which will be more accurate for smaller values of $\sigma.$ For large values of $\sigma$, the actual value of $D_{q_t}(\b{z}_t, \sigma)$ approaches the mean, $\bar{\b{y}}.$ ↩︎
- In the literature on diffusion models, the reverse process is run in
*reverse time*, which means that the time index $t$ is getting smaller during sampling. By contrast, SD flow runs this process in*forward*time. It makes no practical difference, but it could be a potential source of confusion. ↩︎ - This idea is also developed in this preprint of mine. ↩︎
- While MMD gradient flow and SVGD failed to converge in a number of experimental conditions, SD flow converged in every case. ↩︎
- This suggests that the physical inspiration could just have well been gravity instead of electrostatics. ↩︎
- However, the steady-state heat equation for a volume containing a source of heat is equivalent to Poisson’s equation. ↩︎

## Leave a Reply