Following notations in Song et al. (2013), we still consider two random variables X and Y with joint distribution P(X,Y) and, additionally, we consider a prior distribution π on Y.
The kernel sum rule shows that the conditional embedding operator CX∣Y maps the embedding of π(Y) to that of Q(X).
In practice, an estimator μ^Yπ is given in the form ∑i=1nαiky~i=Φα based on samples {y~i}i=1n. Let's also assume that the conditional embedding operator has been estimated from a sample {(xi,yi)}i=1m drawn from the joint distribution with CX∣Y=Υ(G+λI)−1Φ where Υ=(kxi)i=1m, Φ=(kyi)i=1m, Gij=k(yi,yj) and Gij=k(yi,y~j).
The kernel sum rule in the finite sample case has the following form:
A posterior distribution can be expressed in terms of a prior and a likelihood as
Q(Y∣x)=Q(x)P(x∣Y)π(Y),
where Q(x) is the relevant normalisation factor. We seek to construct the conditional embedding operator CY∣Xπ.
The kernel Bayes rule reads
μY∣xπ=CY∣Xπkx=CYXπ(CXXπ)−1kx,
with then CY∣Xπ=CYXπ(CXXπ)−1.
Using the sum rule, CXXπ=C(XX)∣YμYπ and, using the chain rule, CYXπ=(CX∣YCYYπ)t. The finite sample case can also be obtained (and is a bit messy).
Say we're interested in evaluating the expected value of a function g∈H with respect to the posterior Q(Y∣x) or to decode y⋆ most typical of the posterior. Assume that the embedding μY∣xπ is given as ∑i=1nβi(x)ky~i and g=∑i=1mαikyi then
the kernel Bayes average reads
⟨g,μY∣xπ⟩H=βtGα=ij∑αiβj(x)k(yi,y~j),
and the kernel Bayes posterior decoding reads
y⋆=argymin−2βtG:y+k(y,y).
The second expression coming from the minimisation miny∥μY∣xπ−ky∥H2.
In general, the optimisation problem is difficult to solve. It corresponds to the so-called "pre-image" problem in kernel methods.