Probabilistic reasoning with kernel embeddings

Following notations in Song et al. (2013), we still consider two random variables XX and YY with joint distribution P(X,Y)P(X, Y) and, additionally, we consider a prior distribution π\pi on YY.

Kernel sum rule

The marginal distribution of XX can be computed by integrating out YY from the joint density, i.e.:

Q(X) ⁣ ⁣= ⁣ ⁣P(XY)dπ(Y) ⁣ ⁣= ⁣ ⁣EYπ[P(XY)]. Q(X) \quad\!\! =\quad\!\! \int P(X|Y) \mathrm d{\pi}(Y) \quad\!\! =\quad\!\! \mathbb E_{Y\sim\pi}[P(X|Y)].

Embedding it, we have

μXπ ⁣ ⁣:= ⁣ ⁣EXQ[kX] ⁣ ⁣= ⁣ ⁣EYπ[EXY[kX]], \mu_{_X}^\pi \quad\!\! :=\quad\!\! \mathbb E_{X\sim Q}[k_{_X}] \quad\!\! =\quad\!\! \mathbb E_{Y\sim \pi}[\mathbb E_{X|Y}[k_{_X}]],

which leads to the kernel sum rule.

Let XX and YY denote two random variables and π\pi a prior on YY, then the Kernel sum rule reads

μXπ ⁣ ⁣= ⁣ ⁣CXYμYπ \mu_{_X}^\pi \quad\!\! =\quad\!\! \mathcal C_{X|Y}\mu^\pi_{_Y}

This is straightforward to prove using the definition of the conditional embedding (see part 1).

μXπ ⁣ ⁣= ⁣ ⁣EXY[kY] ⁣ ⁣= ⁣ ⁣EYπ[CXYkY]= ⁣ ⁣CXYEYπ[kY] ⁣ ⁣= ⁣ ⁣CXYμY\begin{aligned} \mu_{_X}^\pi \quad\!\! =\quad\!\! \mathbb E_{X|Y}[k_{_Y}] \quad\!\!&=\quad\!\! \mathbb E_{Y\sim \pi}[\mathcal C_{X|Y}k_{_Y}]\\ &=\quad\!\! \mathcal C_{X|Y}\mathbb E_{Y\sim\pi}[k_{_Y}] \quad\!\! =\quad\!\! \mathcal C_{X|Y}\mu_{_Y}\end{aligned}

The kernel sum rule shows that the conditional embedding operator CXY\mathcal C_{X|Y} maps the embedding of π(Y)\pi(Y) to that of Q(X)Q(X).

In practice, an estimator μ^Yπ\hat\mu_{_Y}^\pi is given in the form i=1nαiky~i=Φ~α\sum_{i=1}^n \alpha_i k_{\tilde y_i} = \widetilde\Phi \alpha based on samples {y~i}i=1n\{\tilde y_i\}_{i=1}^n. Let's also assume that the conditional embedding operator has been estimated from a sample {(xi,yi)}i=1m\{(x_i,y_i)\}_{i=1}^m drawn from the joint distribution with CXY^=Υ(G+λI)1Φ\widehat{\mathcal C_{X|Y}}=\Upsilon(G+\lambda I)^{-1}\Phi where Υ=(kxi)i=1m\Upsilon = (k_{x_i})_{i=1}^m, Φ=(kyi)i=1m\Phi=(k_{y_i})_{i=1}^m, Gij=k(yi,yj)G_{ij} = k(y_i,y_j) and G~ij=k(yi,y~j)\widetilde G_{ij} = k(y_i, \tilde y_j).

The kernel sum rule in the finite sample case has the following form:

μ^Xπ ⁣ ⁣= ⁣ ⁣CXY^μ^Yπ ⁣ ⁣= ⁣ ⁣Υ(G+λI)1ΦG~α. \hat\mu_{_X}^\pi \quad\!\! =\quad\!\! \widehat{\mathcal C_{X|Y}}\hat\mu_{_Y}^\pi \quad\!\! =\quad\!\! \Upsilon(G+\lambda I)^{-1}\Phi \widetilde G\alpha.

Kernel chain rule

A joint distribution QQ can be factorised into a product between conditional and marginal with Q(X,Y)=P(XY)π(Y)Q(X, Y)=P(X|Y)\pi(Y).

The kernel chain rule reads

CXYπ ⁣ ⁣= ⁣ ⁣CXYCYYπ. \mathcal C^\pi_{XY} \quad\!\! =\quad\!\! \mathcal C_{X|Y}\mathcal C^\pi_{YY}.

This is straightforward to prove:

CXYπ ⁣ ⁣= ⁣ ⁣E(X,Y)Q[kXkY] ⁣ ⁣= ⁣ ⁣EYπ[EXY[kX]kY]= ⁣ ⁣CXYEYπ[kYkY] ⁣ ⁣= ⁣ ⁣CXYCYYπ.\begin{aligned} \mathcal C^\pi_{XY} \quad\!\!&=\quad\!\! \mathbb E_{(X,Y)\sim Q}[k_{_X}\otimes k_{_Y}] \quad\!\! =\quad\!\! \mathbb E_{Y\sim\pi}[\mathbb E_{X|Y}[k_{_X}] \otimes k_{_Y}]\\ &= \quad\!\! \mathcal C_{X|Y}\mathbb E_{Y\sim \pi}[k_Y \otimes k_Y] \quad\!\! =\quad\!\! \mathcal C_{X|Y}\mathcal C^\pi_{YY}.\end{aligned}

The kernel chain rule in the finite sample case has the following form:

C^XYπ ⁣ ⁣= ⁣ ⁣C^XYC^YYπ ⁣ ⁣= ⁣ ⁣Υ(G+λI)1G~diag(α)Φ~t, \widehat{\mathcal C}_{XY}^\pi \quad\!\! =\quad\!\! \widehat{\mathcal C}_{X|Y}\widehat{\mathcal C}^\pi_{YY} \quad\!\! =\quad\!\! \Upsilon(G+\lambda I)^{-1}\widetilde G\mathrm{diag}(\alpha)\widetilde\Phi^t,

using C^YYπ=Φ~diag(α)Φ~t\widehat{\mathcal C}^\pi_{YY} = \widetilde\Phi \mathrm{diag}(\alpha)\widetilde\Phi^t and C^XY=Υ(G+λI)1Φ\widehat{\mathcal C}_{X|Y} = \Upsilon(G+\lambda I)^{-1}\Phi.

Kernel Bayes rule

A posterior distribution can be expressed in terms of a prior and a likelihood as

Q(Yx) ⁣ ⁣= ⁣ ⁣P(xY)π(Y)Q(x), Q(Y|x) \quad\!\! =\quad\!\! {P(x|Y)\pi(Y)\over Q(x)},

where Q(x)Q(x) is the relevant normalisation factor. We seek to construct the conditional embedding operator CYXπ\mathcal C^\pi_{Y|X}.

The kernel Bayes rule reads

μYxπ ⁣ ⁣= ⁣ ⁣CYXπkx ⁣ ⁣= ⁣ ⁣CYXπ(CXXπ)1kx, \mu^\pi_{_Y|x} \quad\!\! =\quad\!\! \mathcal C^\pi_{Y|X}k_x \quad\!\! =\quad\!\! \mathcal C^\pi_{YX} (\mathcal C^\pi_{XX})^{-1}k_x,

with then CYXπ=CYXπ(CXXπ)1\mathcal C^\pi_{Y|X} = \mathcal C^\pi_{YX}(\mathcal C^\pi_{XX})^{-1}.

Using the sum rule, CXXπ=C(XX)YμYπ\mathcal C^\pi_{XX}=\mathcal C_{(XX)|Y}\mu_{_Y}^\pi and, using the chain rule, CYXπ=(CXYCYYπ)t\mathcal C^\pi_{YX}=(\mathcal C_{X|Y}\mathcal C^\pi_{YY})^t. The finite sample case can also be obtained (and is a bit messy).

Kernel Bayesian average and posterior decoding

Say we're interested in evaluating the expected value of a function gHg\in \mathcal H with respect to the posterior Q(Yx)Q(Y|x) or to decode yy^{\star} most typical of the posterior. Assume that the embedding μ^Yxπ\widehat\mu^{\pi}_{_{Y|x}} is given as i=1nβi(x)ky~i\sum_{i=1}^n \beta_{i}(x)k_{\tilde y_{i}} and g=i=1mαikyig=\sum_{i=1}^m\alpha_{i}k_{y_{i}} then

the kernel Bayes average reads

g,μYx^πH ⁣ ⁣= ⁣ ⁣βtG~α ⁣ ⁣= ⁣ ⁣ijαiβj(x)k(yi,y~j), \left\langle g,\widehat{\mu_{Y|x}}^{\pi}\right\rangle_{\mathcal H} \quad\!\! =\quad\!\! \beta^{t} \widetilde G \alpha \quad\!\! =\quad\!\! \sum_{ij} \alpha_{i}\beta_{j}(x)k(y_{i},\tilde y_{j}),

and the kernel Bayes posterior decoding reads

y ⁣ ⁣= ⁣ ⁣argminy 2βtG~:y+k(y,y). y^{\star} \quad\!\! =\quad\!\! \arg\min_{y} \,\, -2\beta^{t}\widetilde G_{:y}+k(y,y).

The second expression coming from the minimisation minyμYxπ^kyH2\min_{y}\|\widehat{ \mu^{\pi}_{_{Y|x}}}-k_{y}\|_{\mathcal H}^{2}.

In general, the optimisation problem is difficult to solve. It corresponds to the so-called "pre-image" problem in kernel methods.

References

  1. Song, Fukumizu and Gretton, Kernel embeddings of conditional distributions, 2013. The paper these notes are mainly based on.

See also the references given in the first part.