Introduction to RKHS

These notes were prepared with help from Dino Sejdinovic for a talk to a kernel methods reading group given in Oxford in July 2015. The notes are fairly dry, I intend to come back and add more comments on the intuition in the future (in particular for the second part).

Basic definitions

A Hilbert space H\mathcal H of functions f:XRf:\mathcal X\mapsto \mathbb R defined on a non-empty set X\mathcal X is said to be a reproducing kernel Hilbert space (RKHS) if evaluation functionals δx:ff(x)\delta_x: f\mapsto f(x) are continuous for all xXx\in\mathcal X

If we consider a RKHS then, by Riesz's representation theorem, since δx\delta_x is a continuous functional, it has a representer in H\mathcal H that we can denote kxk_{x} such that

f,kxH=δx(f) ⁣ ⁣= ⁣ ⁣f(x).\begin{array}{rcl} \left\langle f, k_{x}\right\rangle_{\mathcal H} &=& \delta_x(f) \quad\!\! =\quad\!\! f(x).\end{array}

We can then define a (positive-definite) bilinear form k:X×XRk:\mathcal X\times\mathcal X \to \mathbb R as k(x,x):=kx,kxHk(x, x'):=\left\langle k_{x}, k_{x'}\right\rangle_{\mathcal H}. This is known as the reproducing kernel of H\mathcal H; we will also write kx=k(,x)k_{x} = k(\cdot, x).

There's then an important theorem that links the two (and that we won't prove, though a detailed proof can be found here) reproduced in a simplified form below:

(Moore-Aronszajn) Every positive-definite bilinear form kk is a reproducing kernel for some Hilbert space Hk\mathcal H_k.

When the kernel is clear from the context, we will simply write H\mathcal H for the RKHS and kk for its reproducing kernel.

In essence we consider spaces of functions that can be characterised by a kernel function. As will be seen, this kernel function encapsulates important properties of the type of functions that can be found in the space (e.g. how smooth they are etc.).

Kernel embedding of a distribution

A classical way to try to represent points in a space X\mathcal X is to embed them in Rs\mathbb R^s using a ss-dimensional feature map Φ:XRs\Phi:\mathcal X\to \mathbb R^s with x ⁣ ⁣ ⁣ ⁣Φ(x) = (φ1(x),φ2(x),,φs(x)).\begin{array}{rcl} x\quad\!\!\mapsto\quad\!\!\Phi(x)\,\,=\,\, (\varphi_1(x), \varphi_2(x), \dots, \varphi_s(x)).\end{array} Instead, we can now consider embedding points in a RKHS with the infinite dimensional feature map xkxx\mapsto k_{x}. In that case, we have an easily computable inner product between points with

kx,kyH=k(,x),k(,y)H ⁣ ⁣= ⁣ ⁣k(x,y).\begin{array}{rcl} \left\langle k_{x}, k_{y}\right\rangle_{\mathcal H} &=& \left\langle k(\cdot, x), k(\cdot, y)\right\rangle_{\mathcal H} \quad\!\! =\quad\!\! k(x, y).\end{array}

Recall that an inner product is a measure of alignment so that this automatically gives us a measure of similarity between points through this kernel.

When the embedding is injective (i.e.: different objects are mapped to different points in the RKHS), the corresponding kernel is said to be characteristic (this is often the case for standard kernels).

Mean embedding

An example of objects we can embed in an RKHS are probability distributions. Each distribution is then considered as a point which we can embed through the mean-embedding.

Let PP denote a distribution among a set P(Ω)\mathcal P(\Omega) of distributions over some Ω\Omega, the mean embedding is defined as follows:

P ⁣ ⁣ ⁣ ⁣μ(P;k) ⁣ ⁣:= ⁣ ⁣EXP[k(,X)] ⁣ ⁣= ⁣ ⁣EXP[kX],\begin{array}{rcl} P \quad\!\! \mapsto\quad\!\! \mu(P; k) \quad\!\! :=\quad\!\! \mathbb E_{X\sim P}[k(\cdot, X)] \quad\!\! =\quad\!\! \mathbb E_{X\sim P}[k_{_X}],\end{array}

where EXP\mathbb E_{X\sim P} denotes the expectation with respect to the random variable XX distributed as PP. Naturally, μ(P;k)H\mu(P; k)\in\mathcal H. When the kernel and distribution are clear from the context, we will simply write μ(P;k)=EX[kX]=:μX\mu(P; k)=\mathbb E_X[k_X]=:\mu_{_X} where, implicitly XPX\sim P.

Observe that we now have this continuous embedding instead of a finite-dimensional embedding that we could have considered such as P(E[φ1(X)],,E[φs(X)])P\mapsto (\mathbb E[\varphi_1(X)], \dots, \mathbb E[\varphi_s(X)]). Also, as before, we inherit a notion of similarity between points (here probability distributions) by considering the inner product on the RKHS: μ(P;k),μ(Q;k)H ⁣ ⁣= ⁣ ⁣EXY[k(X,Y)],\begin{array}{rcl} \left\langle \mu(P; k), \mu(Q; k)\right\rangle_{\mathcal H} \quad\!\! =\quad\!\! \mathbb E_{X Y}[k(X, Y)],\end{array} and this can easily be estimated if we have samples from both PP and QQ.

Note finally that μX\mu_{_X} represents expectations with respect to PP i.e.: for any fHf\in\mathcal H, f,μXH=EX[f,kXH] ⁣ ⁣= ⁣ ⁣EX[f(X)].\begin{array}{rcl} \left\langle f, \mu_{_X}\right\rangle_{\mathcal H} &=& \mathbb E_X[{\left\langle f, k_{_X}\right\rangle_{\mathcal H}}] \quad\!\! =\quad\!\! \mathbb E_X[f(X)] .\end{array}

Joint embedding

The generalisation to joint distributions is straightforward using tensor product feature spaces.

Let X,YX, Y be jointly distributed according to some distribution PP, the joint embedding of PP is defined as P ⁣ ⁣ ⁣ ⁣CXY(P) ⁣ ⁣:= ⁣ ⁣EXY[kXkY],\begin{array}{rcl} P \quad\!\! \mapsto\quad\!\! \mathcal C_{XY}(P) \quad\!\! :=\quad\!\! \mathbb E_{XY}[k_{_X}\otimes k_{_Y}],\end{array} assuming we use the same kernel for both variables.

The tensor product satisfies kxky,kxkyHH=k(x,x)k(y,y)\left\langle k_{_x} \otimes k_{_y}, k_{x'} \otimes k_{y'}\right\rangle_{\mathcal H\otimes \mathcal H} = k(x, x')k(y, y').

In the same way that μX\mu_{_X} represents the expectation operator, the joint-embedding CXY\mathcal C_{XY} can be viewed as the uncentered cross-covariance operator: for any two functions f,gHf, g \in \mathcal H, their uncentered covariance is given by

EXY[f(X)g(Y)]=fg,CXYHH ⁣ ⁣= ⁣ ⁣f,CXYgH.\begin{array}{rcl} \mathbb E_{XY}[f(X)g(Y)] &=& \left\langle f\otimes g, \mathcal C_{XY}\right\rangle_{\mathcal H\otimes \mathcal H} \quad\!\! =\quad\!\! \left\langle f, \mathcal C_{XY} g\right\rangle_{\mathcal H}. \end{array}

Following the same reasoning, we can define the auto-covariance operators CXX\mathcal C_{XX} and CYY\mathcal C_{YY}. Note also that, where μX\mu_{_X} represents expectations with respect to PP (the distribution of XX), these operators represent cross-covariance/auto-covariance with respect to PP.

Remark: we have assumed that both random variables share the same kernel but this need not be the case: we can consider a second kernel kk' with a RKHS H\mathcal H'; the cross-covariance operator then belongs to the product space HH\mathcal H\otimes \mathcal H' (which is also a RKHS).

MMD and HSIC

We have now seen two ways of embedding a distribution into a RKHS, now let's see why that can be useful. When considering a characteristic kernel (such as, for example, the Gaussian RBF with k(x,x)=exp(σxx22)k(x, x')=\exp(-\sigma\|x-x'\|^2_2)), the RKHS embedding is injective. In that case, we can use the distance in the RKHS (induced by the inner-product) as a proxy for similarity in the distribution space. This can be used in the two-sample test (to test whether two random variables are distributed according to the same distribution) or when testing for independence between random variables.

In the kernel two sample test (Gretton et al. (2012a)), the test statistic is the squared distance between the embeddings of the two distributions:

The kernel Maximum Mean Discrepancy (MMD) measure is defined for two distributions PP and QQ by MMD(P,Q):=μXμYH2,\begin{array}{rcl} \mathrm{MMD}(P, Q) &:=& \|\mu_{_X} - \mu_{_Y}\|_{\mathcal H}^2,\end{array} where XPX\sim P and YQY\sim Q.

It is then possible to study the asymptotic properties of both the MMD and the MMD² and build hypothesis tests at a given level, independently of the distributions considered (see Gretton et al. (2012a) corollary 9 and 11).

When testing for independence, the test statistic can be the squared distance between the embeddings of the joint distribution and the product of its marginals which leads to the kernel independence test (Gretton et al. (2007)):

The Hilbert-Schmidt Independence Criterion (HSIC) is defined for two distributions PP and QQ by HSIC(P,Q):=CXYμXμYH2,\begin{array}{rcl} \mathrm{HSIC}(P, Q) &:=& \|\mathcal C_{XY} - \mu_{_X} \otimes \mu_{_Y}\|_{\mathcal H}^2,\end{array} where XPX\sim P and YQY\sim Q.

Again, it is then possible to study the asymptotic properties of the HSIC and build a hypothesis test at a given level, independently of the distributions considered (cf. (Gretton et al. (2007))).

Finite sample embeddings

All of the embeddings defined above can readily be estimated using samples drawn from the distributions of interest. Let {x1,,xn}\{x_1, \dots, x_n\} be an iid. draw from PP, the empirical kernel mean embedding is defined as

μ^X ⁣ ⁣:= ⁣ ⁣n1i=1nkxi. \widehat{\mu}_{_X} \quad\!\! :=\quad\!\! n^{-1}\sum_{i=1}^n k_{x_i}.

As for standard Monte Carlo estimators, the rate of convergence is O(1/n)\mathcal O(1/\sqrt{n}) (an hence does not depend upon the dimensionality of the underlying space). Similarly, for an iid draw of pairs {(x1,y1),,(xn,yn)}\{(x_1,y_1), \dots, (x_n, y_n)\}, the empirical covariance operator is defined as

C^XY ⁣ ⁣= ⁣ ⁣n1i=1nkxikyi= ⁣ ⁣n1ΥΦt\begin{aligned} \widehat{\mathcal C}_{XY} \quad\!\!&=\quad\!\! n^{-1}\sum_{i=1}^n k_{x_i} \otimes k_{y_i} \\ &=\quad\!\! n^{-1}\Upsilon\Phi^t \end{aligned} where Υ:=(kx1,,kxn)\Upsilon := (k_{x_1}, \dots, k_{x_n}) and Φ:=(ky1,,kyn)\Phi:=(k_{y_1}, \dots, k_{y_n}) are the feature matrices.

To conclude, it is straightforward to obtain empirical estimators for the MMD and HSIC criterion considering kernel elements k(xi,xj)k(x_i, x_j), k(yi,yj)k(y_i, y_j) and k(xi,yj)k(x_i, y_j). In the case of the MMD for instance, one has:

MMD^(P,Q) ⁣ ⁣= ⁣ ⁣1n2i,j=1n(k(xi,xj)+k(yi,yj)2k(xi,yj)) \widehat{\mathrm{MMD}}(P, Q) \quad\!\! =\quad\!\! {1\over n^{2}}\sum_{i,j=1}^n \left(k(x_{i},x_{j})+k(y_{i},y_{j})-2k(x_{i},y_{j})\right)

Kernel embeddings of conditional distributions

Pointwise definition

In line with the definitions presented earlier, the kernel embedding of a conditional distribution P(YX)P(Y|X) is defined as

μYx=EYx[kY],\begin{array}{rcl} \mu_{_{Y|x}} &=& \mathbb E_{Y|x}[k_{_Y}],\end{array}

and the conditional expectation of a function gHg\in \mathcal H can be expressed as

EYx[g(Y)]=g,μYxH.\begin{array}{rcl} \mathbb E_{Y|x}[g(Y)] &=& \left\langle g, \mu_{_{Y|x}}\right\rangle_{\mathcal H}.\end{array}

Note that we now have a family of points in the RKHS indexed by xx, the value upon which we condition.

Conditional operator

We can also define an operator CYX\mathcal C_{Y|X} such that

μYx=CYXkx.\begin{array}{rcl} \mu_{_{Y|x}} &=& \mathcal C_{Y|X} k_x.\end{array}

To do so, we must first introduce a result for which we will provide a partial proof (the full proof can be found in Fukumizu et al. (2004)).

The following identity holds (under mild technical conditions):

CXXEYX[g(Y)]=CXYg.\begin{array}{rcl} \mathcal C_{XX} \mathbb E_{Y|X}[g(Y)] &=& \mathcal C_{XY}g. \end{array}

To prove this, note that for fHf \in \mathcal H, using the definition of the joint embedding, we have

f,CXXEYX[g(Y)]H ⁣ ⁣= ⁣ ⁣EX[f(X)EYX[g(Y)]]= ⁣ ⁣EX[EYX[f(X)g(Y)]] ⁣ ⁣= ⁣ ⁣EXY[f(X)g(Y)],\begin{aligned} \left\langle f, \mathcal C_{XX} \mathbb E_{Y|X}[g(Y)]\right\rangle_{\mathcal H} \quad\!\! &=\quad\!\! \mathbb E_X[f(X)\mathbb E_{Y|X}[g(Y)]] \\ &=\quad\!\! \mathbb E_X[\mathbb E_{Y|X}[f(X)g(Y)]] \quad\!\! =\quad\!\! \mathbb E_{XY}[f(X)g(Y)],\end{aligned} where at the last equality, we used the tower property of expectations. Comparing with (8) we get (17).

With this, we can show another nice identity. Using the definition of μYx\mu_{Y|x}, we have g,μYxH=EYx[g(Y)] ⁣ ⁣= ⁣ ⁣EYX[g(Y)],kxH.\begin{array}{rcl} \left\langle g, \mu_{Y|x}\right\rangle_{\mathcal H} &=& \mathbb E_{Y|x}[g(Y)] \quad\!\! =\quad\!\! \left\langle \mathbb E_{Y|X}[g(Y)], k_x\right\rangle_{\mathcal H}.\end{array} But, using (17), we have EYX[g(Y)],kxH=CXX1CXYg,kxH ⁣ ⁣= ⁣ ⁣g,CYXCXX1kxH\begin{array}{rcl} \left\langle \mathbb E_{Y|X}[g(Y)], k_x\right\rangle_{\mathcal H} &=& \left\langle \mathcal C_{XX}^{-1}\mathcal C_{XY} g, k_x\right\rangle_{\mathcal H} \quad\!\! =\quad\!\! \left\langle g, \mathcal C_{YX}C_{XX}^{-1} k_x\right\rangle_{\mathcal H}\end{array} where at the last equality we took the adjoint of CXX1CXY\mathcal C_{XX}^{-1}\mathcal C_{XY} which allows us to introduce the definition that follows.

The conditional embedding operator is defined as

CYX ⁣ ⁣:= ⁣ ⁣CYXCXX1. \mathcal C_{Y|X} \quad\!\! :=\quad\!\! \mathcal C_{YX}\mathcal C_{XX}^{-1}.

In practice, CXX\mathcal C_{XX} is a compact operator which means that its eigenvalues go to zero and hence its inverse is not a bounded operator. So the definition of CYX\mathcal C_{Y|X} given above is a slight abuse of notation. The inversion of CXX\mathcal C_{XX} can be replaced by the regularised inverse (CXX+λI)1(\mathcal C_{XX}+\lambda I)^{-1} where λ\lambda is a positive factor that can be determined by cross-validation.

Finite sample kernel estimator

If we consider an iid draw {(xi,yi)}i=1m\{(x_i, y_i)\}_{i=1}^m from a joint distribution PP, we know that the empirical estimators for CYX\mathcal C_{YX} and CXX\mathcal C_{XX} can be written as

CYX^ ⁣ ⁣= ⁣ ⁣1nΦΥt ⁣ ⁣and ⁣ ⁣CXX^ ⁣ ⁣= ⁣ ⁣=1nΥΥt, \widehat{\mathcal C_{YX}} \quad\!\! =\quad\!\! {1\over n}\Phi\Upsilon^t\quad\!\!\text{and}\quad\!\! \widehat{\mathcal C_{XX}} \quad\!\! =\quad\!\! = {1\over n}\Upsilon\Upsilon^t,

where Φ\Phi and Υ\Upsilon are defined as before (see equation (12)). Using a trick from linear algebra for the regularised inverse (coming from an application of Woodbury's formula), we have:

CYX^ ⁣ ⁣= ⁣ ⁣n1Φ[Υt(λI+n1ΥΥt)] ⁣ ⁣= ⁣ ⁣n1Φ[n(λI+ΥtΥ)Υt], \widehat{\mathcal C_{Y|X}} \quad\!\! =\quad\!\! n^{-1} \Phi \left[\Upsilon^t\left(\lambda I + n^{-1}\Upsilon\Upsilon^t\right)\right] \quad\!\! =\quad\!\! n^{-1}\Phi\left[n(\lambda I + \Upsilon^t\Upsilon)\Upsilon^t\right],

which gives an estimator for the conditional embedding operator.

A finite sample estimator for the conditional embedding operator is

CYX^ ⁣ ⁣= ⁣ ⁣Φ[K+λI]1Υt \widehat{\mathcal C_{Y|X}} \quad\!\! =\quad\!\! \Phi[K+\lambda I]^{-1}\Upsilon^t

where K:=ΥtΥK:=\Upsilon^t\Upsilon is the Gram matrix.

The regularisation parameter λ\lambda helps to control for overfitting. The resulting kernel embedding is then

μ^Yx ⁣ ⁣= ⁣ ⁣Φβ(x) \widehat\mu_{Y|x} \quad\!\! =\quad\!\! \Phi\beta(x)

where β(x):=(K+λI)1K:x\beta(x):= (K+\lambda I)^{-1}K_{:x} and K:x:=[k(x,xi)]i=1mK_{:x}:=[k(x, x_i)]_{i=1}^m . It is thus a weighted sum of samples of YY in the feature space with weights depending on the conditioning variable xx.

References

  1. Fukumizu, Bach and Jordan, Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces, 2004. A key paper in the RKHS literature.

  2. Gretton, Fukumizu, Hui Teo, Le Song, Schölkopf, Smola, A kernel statistical test of independence, 2007. Paper describing how to perform an independence test using the HSIC.

  3. Gretton, Borgwardt, Rasch, Schölkopf, Smola, A kernel two-sample test, 2012. Paper describing how to perform a two-sample test using the MMD.

  4. Gretton, Sriperumbudur, Sejdinovic, Strathmann, Balakrishnan, Pontil and Fukumizu, Optimal kernel choice for large-scale two-sample tests, 2012. Another paper describing how to perform a two-sample test using the MMD focusing on computational perspectives.

  5. Song, Fukumizu and Gretton, Kernel embeddings of conditional distributions, 2013. The paper these notes are mainly based on.