Reverse engineering the Gumbel Max Trick

Last updated on Jul 24, 2023

Intro

I recently learned about the so-called Gumbel-Max and Gumbel-Softmax tricks. Essentially, the Gumbel-Max trick says that if we have a categorical distribution $\vec{π} = π_{1}, \dots π_{K}$ and i.i.d. $Gumbel (0, 1)$ -distributed random variables $G_{i}, 1 \leq i \leq K$ , then $\forall k P (G_{k} + \log (π_{k}) = max {G_{i} + \log (π_{i}) : 1 \leq i \leq K}) = π_{k} .$ The Gumbel-Softmax trick is then a continuous relaxation of the above which uses the fact that letting the temperature of a softmax go to 0 gives the (one-hot encoding of the) argmax function.

Note that $Gumbel (0, 1)$ is the distribution of $- \log (- \log (u))$ , where $u \sim U ([0, 1])$ is uniformly distributed. So, instead of sampling from $\vec{π}$ , we can sample $K$ uniforms, compute their double logarithms, add the logprobs (actually, it’s enough to add the log-potentials but I’ll get to that) and take argmax. This seems like more work than simply sampling from $π$ (which is normally done by sampling a single $u \sim U (0, 1)$ and checking between which two numbers in $cumsum (π)$ it lies). And indeed it is, if we want just one sample from $π$ . But this method of sampling has other benefits, like the ability to pre-compute the Gumbel samples or to avoid the normalisation of $π$ at every forward pass, or to perform reservoir sampling. But I don’t want to focus on these comparisons now (if you’re interested, I found some nice concise explanations in this blogpost).

These tricks are really neat but I wondered how people came up with them. So I tried to reverse-engineer the Gumbel Max trick and then (in a future post) write down some of the things I’ve learned about softmax and temperature annealing. All these things are well-known to most people who do probability theory for a living but I thought a couple of blogposts might be useful for me to put my thoughts in order and potentially for readers to follow along, so here goes.

Reverse-Engineering Gumbel-Max

Suppose we are given a categorical distribution $\vec{π} = π_{1}, \dots π_{K}$ . We want to find independent, continuous real random variables $\vec{X} = X_{1}, \dots, X_{K}$ which satisfy the condition

$\forall k P (X_{k} = max {X_{i} : 1 \leq i \leq K}) = π_{k} .$ One quick observation we can make is that if ${X_{1}, \dots X_{K}}$ satisfy this condition and $f$ is any strictly increasing function, then ${f (X_{1}), \dots f (X_{K})}$ will also satisfy it. So we already see this is a really flexible condition. But anyway, let’s actually find some $X$ ’s that do what we want.

Let $F_{i}$ denote the CDF of $X_{i}$ , that is $F_{i} (t) = P (X_{i} \leq t)$ and let $f_{i} = F_{i}^{'}$ denote the pdf (I will deliberately ignore mathematical subtleties around differentiability, continuity etc. in this post). Then, the above equations can be rewritten as: $\begin{aligned} π_{k} & = P (X_{k} = max {X_{i} : 1 \leq i \leq K}) \\ = \int_{R} P (s \geq X_{1}, \dots, s \geq X_{K} | X_{k} = s) f_{k} (s) d s \\ = \int_{R} f_{k} (s) \prod_{i \neq k} F_{i} (s) d s, \end{aligned}$ where in the last equality we used the independence assumption.

Notice that on the right-hand side, we are integrating a positive function of $s$ from $- \infty$ to $\infty$ and getting the number $π_{k}$ as a result. If instead we integrated from $- \infty$ to some $t \in R$ , we would get $H_{k} (t) π_{k}$ , where $H_{k} (t)$ is some number between $0$ and $1$ which increases with $t$ (by the positivity of the integrand), approaches $0$ as $t \to - \infty$ and approaches $1$ as $t \to \infty$ . In other words, the function $H_{k}$ is a CDF for some continuous distribution.

At this point, in order to find concrete solutions to the above set of $K$ equations, we make the simplifying assumption that all these CDFs are equal, i.e. $H_{1} = H_{2} = \dots = H_{K} = : H$ . So we get the following integral equations: $\forall k \forall t \int_{- \infty}^{t} \prod_{i \neq k} F_{i} (s) f_{k} (s) d s = H (t) π_{k} .$ Now, summing these over $1 \leq k \leq K$ , we obtain: $\int_{- \infty}^{t} \frac{d}{d s} \prod_{1 \leq i \leq K} F_{i} (s) d s = H (t),$ from which we obtain $\prod_{1 \leq i \leq K} F_{i} (t) = H (t)$ (we can see that the constant of integration is $0$ by letting $s \to \infty$ and using the fact that all functions involved are cdfs).

On the other hand, differentiating both sides of the integral equations with respect to $t$ gives the functional equations: $\forall k \forall t \prod_{i \neq k} F_{i} (t) f_{k} (t) = h (t) π_{k} .$

Substituting the fact that $\prod_{1 \leq i \leq K} F_{i} (t) = H (t)$ into these, we obtain: $\begin{aligned} \frac{H (t) f_{k} (t)}{F_{k} (t)} & = h (t) π_{k} \\ \Leftrightarrow \frac{f_{k} (t)}{F_{k} (t)} & = \frac{h (t)}{H (t)} π_{k} \\ \Leftrightarrow \frac{d}{d t} \log (F_{k} (t)) & = \frac{d}{d t} \log (H (t)) π_{k} \end{aligned}$

Integrating both sides we find $\log (F_{k} (t)) = \log (H (t)) π_{k} \forall k$ (constant of integration is again $0$ by considerations at $\infty$ ), and so we have: $\forall 1 \leq k \leq K F_{k} (t) = H (t)^{π_{k}} .$ Thus, for any choice of CDF $H$ , we obtain independent random variables $X_{k} \sim F_{k} = H^{π_{k}}$ which satisfy our requirement.

We can now use Inverse Transform Sampling to sample from these distributions. That is, we know that if $U_{k} \sim U ([0, 1])$ then $F_{k}^{- 1} (U_{k}) = H^{- 1} (U_{k}^{\frac{1}{π_{k}}})$ is distributed according to $F_{k}$ .

But, actually, notice that we are really free in our choice of $H^{- 1}$ : all we need is for it to be a strictly increasing function defined on the unit interval! So, we can stop worrying about $H$ or $h$ and we can just say:

Given a categorical distribution $\vec{π} = π_{1}, \dots π_{K}$ , a strictly increasing function $f : (0, 1) \to R$ , and $U_{k} \sim U ((0, 1)),; 1 \leq k \leq K$ , the random variables $X_{k} : = f (U^{\frac{1}{π_{k}}})$ satisfy $P (argmax (\vec{X}) = k) = π_{k}, \forall 1 \leq k \leq K .$

One obvious choice is to take $f$ to be the identity function, giving us $X_{k} = U_{k}^{\frac{1}{π_{k}}} .$

Another option is to take $f = \log$ , giving us $X_{k} = \frac{\log (U_{k})}{π_{k}} .$ Arguably, the most popular choice is to take $f (t) = - \log (- \log (t))$ , which gives: $X_{k} = - \log (- \log (U^{\frac{1}{π_{k}}})) = - \log (- \log (U)) + \log (π_{k}) = G_{k} + \log (π_{k}),$ where $G_{k} \sim Gumbel (0, 1)$ .

And so, we’ve “rediscovered” the Gumbel-Max trick….almost. The last piece is to notice that the function $f$ can itself depend on $\vec{π}$ and this can be quite useful if, for example, the distribution $\vec{π}$ is computed by some neural network in an unnormalized form. That is, suppose our network outputs the log-potentials $\vec{ℓ} = (ℓ_{1}, \dots, ℓ_{K}) \in R^{K}$ and to compute the probabilities $π_{k} \propto \exp (ℓ_{k})$ , we need to know the normalising constant $Z (\vec{ℓ}) = \sum_{i = 1}^{K} \exp (ℓ_{i})$ . The cool thing is, we don’t actually need to compute the normalising constant to generate variables $X_{k}$ with the desired property. For example, we could set $f (t) = - \log (- \log (t)) + LSE (\vec{ℓ})$ , where $LSE$ is “LogSumExp”, i.e. $LSE (\vec{ℓ}) : = \log (\sum_{i = 1}^{K} \exp (ℓ_{i}))$ . Then we get $X_{k} = G_{k} + \log (π_{k}) + lse (\vec{ℓ}) = G_{k} + ℓ_{k},$ which is the more general Gumbel-Max trick which allows us to sample from a categorical distribution knowing only its log-potentials.

I’ll end with some Julia code showing a few simulations which I used as a sanity check that it all works. In the next post I’ll talk about the Gumbel-Softmax trick and how I think about temperature annealing. See you then!

import Plots, Random

import Pkg; Pkg.add("StatsPlots")
import StatsPlots

# Functions that operate on u and pi
f_1 = (u, pi) -> u .^ (1 ./ pi) # identity
f_2 = (u, pi) -> log.(u) ./ pi # log
f_3 = (u, pi) -> -log.(-log.(u)) .+ log.(pi) # Gumbel

# Functions that only use the log-potentials 
g_1 = (u, logit) -> u .^ (1 ./ exp.(logit)) # identity
g_2 = (u, logit) -> log.(u) ./ exp.(logit) # log
g_3 = (u, logit) -> -log.(-log.(u)) .+ logit # Gumbel

K = 12
logits = randn(Float64, (1, K))
unnormalized_pi = exp.(logits)
pi = unnormalized_pi ./ sum(unnormalized_pi)

function get_empirical_dist_from_argmaxes(f, u, pi)
    x = f.(u, pi)
    argmaxes = mapslices(argmax, x, dims=2)
    counts = [0 for i in pi]
    for val in argmaxes
        counts[val] += 1
    end
    return counts ./ sum(counts)
end

reps = 10000
dists = Dict("π"=>pi)
for (name_f, name_g, f, g) in zip(["f₁", "f₂", "f₃"], ["g₁", "g₂", "g₃"], [f_1, f_2, f_3], [g_1, g_2, g_3])
    dists[name_f] = get_empirical_dist_from_argmaxes(f, Random.rand(Float64, (reps, K)), pi)
    dists[name_g] = get_empirical_dist_from_argmaxes(g, Random.rand(Float64, (reps, K)), logits)
end

names_to_plot = ["π", "f₁", "f₂", "f₃", "g₁", "g₂", "g₃"]
dists_to_plot = vcat([dists[name] for name in names_to_plot]...)
bar_chart = StatsPlots.groupedbar(
    vec(dists_to_plot),
    groups=repeat(names_to_plot, outer=K),
    xlabel="class"
)

Reverse engineering the Gumbel Max Trick

Intro

Reverse-Engineering Gumbel-Max

Momchil Konstantinov

ML Science/Engineering, Maths PhD