Introduction to RKHS

probability, rkhs, statistics

These notes were prepared with help from Dino Sejdinovic for a talk to a kernel methods reading group given in Oxford in July 2015. The notes are fairly dry, I intend to come back and add more comments on the intuition in the future (in particular for the second part).

Basic definitions
Kernel embedding of a distribution
Kernel embeddings of conditional distributions
References

Basic definitions

A Hilbert space

\mathcal H

of functions

f:\mathcal X\mapsto \mathbb R

defined on a non-empty set

\mathcal X

is said to be a reproducing kernel Hilbert space (RKHS) if evaluation functionals

\delta_x: f\mapsto f(x)

are continuous for all

x\in\mathcal X

If we consider a RKHS then, by Riesz's representation theorem, since $\delta_x$ is a continuous functional, it has a representer in $\mathcal H$ that we can denote $k_{x}$ such that

\begin{array}{rcl} \left\langle f, k_{x}\right\rangle_{\mathcal H} &=& \delta_x(f) \quad\!\! =\quad\!\! f(x).\end{array}

We can then define a (positive-definite) bilinear form $k:\mathcal X\times\mathcal X \to \mathbb R$ as $k(x, x'):=\left\langle k_{x}, k_{x'}\right\rangle_{\mathcal H}$ . This is known as the reproducing kernel of $\mathcal H$ ; we will also write $k_{x} = k(\cdot, x)$ .

There's then an important theorem that links the two (and that we won't prove, though a detailed proof can be found here) reproduced in a simplified form below:

(Moore-Aronszajn) Every positive-definite bilinear form

k

is a reproducing kernel for some Hilbert space

\mathcal H_k

When the kernel is clear from the context, we will simply write $\mathcal H$ for the RKHS and $k$ for its reproducing kernel.

In essence we consider spaces of functions that can be characterised by a kernel function. As will be seen, this kernel function encapsulates important properties of the type of functions that can be found in the space (e.g. how smooth they are etc.).

Kernel embedding of a distribution

A classical way to try to represent points in a space $\mathcal X$ is to embed them in $\mathbb R^s$ using a $s$ -dimensional feature map $\Phi:\mathcal X\to \mathbb R^s$ with $\begin{array}{rcl} x\quad\!\!\mapsto\quad\!\!\Phi(x)\,\,=\,\, (\varphi_1(x), \varphi_2(x), \dots, \varphi_s(x)).\end{array}$ Instead, we can now consider embedding points in a RKHS with the infinite dimensional feature map $x\mapsto k_{x}$ . In that case, we have an easily computable inner product between points with

\begin{array}{rcl} \left\langle k_{x}, k_{y}\right\rangle_{\mathcal H} &=& \left\langle k(\cdot, x), k(\cdot, y)\right\rangle_{\mathcal H} \quad\!\! =\quad\!\! k(x, y).\end{array}

Recall that an inner product is a measure of alignment so that this automatically gives us a measure of similarity between points through this kernel.

When the embedding is injective (i.e.: different objects are mapped to different points in the RKHS), the corresponding kernel is said to be characteristic (this is often the case for standard kernels).

Mean embedding

An example of objects we can embed in an RKHS are probability distributions. Each distribution is then considered as a point which we can embed through the mean-embedding.

Let $P$ denote a distribution among a set $\mathcal P(\Omega)$ of distributions over some $\Omega$ , the mean embedding is defined as follows:

\begin{array}{rcl} P \quad\!\! \mapsto\quad\!\! \mu(P; k) \quad\!\! :=\quad\!\! \mathbb E_{X\sim P}[k(\cdot, X)] \quad\!\! =\quad\!\! \mathbb E_{X\sim P}[k_{_X}],\end{array}

where $\mathbb E_{X\sim P}$ denotes the expectation with respect to the random variable $X$ distributed as $P$ . Naturally, $\mu(P; k)\in\mathcal H$ . When the kernel and distribution are clear from the context, we will simply write $\mu(P; k)=\mathbb E_X[k_X]=:\mu_{_X}$ where, implicitly $X\sim P$ .

Observe that we now have this continuous embedding instead of a finite-dimensional embedding that we could have considered such as $P\mapsto (\mathbb E[\varphi_1(X)], \dots, \mathbb E[\varphi_s(X)])$ . Also, as before, we inherit a notion of similarity between points (here probability distributions) by considering the inner product on the RKHS: $\begin{array}{rcl} \left\langle \mu(P; k), \mu(Q; k)\right\rangle_{\mathcal H} \quad\!\! =\quad\!\! \mathbb E_{X Y}[k(X, Y)],\end{array}$ and this can easily be estimated if we have samples from both $P$ and $Q$ .

Note finally that $\mu_{_X}$ represents expectations with respect to $P$ i.e.: for any $f\in\mathcal H$ , $\begin{array}{rcl} \left\langle f, \mu_{_X}\right\rangle_{\mathcal H} &=& \mathbb E_X[{\left\langle f, k_{_X}\right\rangle_{\mathcal H}}] \quad\!\! =\quad\!\! \mathbb E_X[f(X)] .\end{array}$

Joint embedding

The generalisation to joint distributions is straightforward using tensor product feature spaces.

Let

X, Y

be jointly distributed according to some distribution

P

, the joint embedding of

P

is defined as

\begin{array}{rcl} P \quad\!\! \mapsto\quad\!\! \mathcal C_{XY}(P) \quad\!\! :=\quad\!\! \mathbb E_{XY}[k_{_X}\otimes k_{_Y}],\end{array}

assuming we use the same kernel for both variables.

The tensor product satisfies $\left\langle k_{_x} \otimes k_{_y}, k_{x'} \otimes k_{y'}\right\rangle_{\mathcal H\otimes \mathcal H} = k(x, x')k(y, y')$ .

In the same way that $\mu_{_X}$ represents the expectation operator, the joint-embedding $\mathcal C_{XY}$ can be viewed as the uncentered cross-covariance operator: for any two functions $f, g \in \mathcal H$ , their uncentered covariance is given by

\begin{array}{rcl} \mathbb E_{XY}[f(X)g(Y)] &=& \left\langle f\otimes g, \mathcal C_{XY}\right\rangle_{\mathcal H\otimes \mathcal H} \quad\!\! =\quad\!\! \left\langle f, \mathcal C_{XY} g\right\rangle_{\mathcal H}. \end{array}

Following the same reasoning, we can define the auto-covariance operators $\mathcal C_{XX}$ and $\mathcal C_{YY}$ . Note also that, where $\mu_{_X}$ represents expectations with respect to $P$ (the distribution of $X$ ), these operators represent cross-covariance/auto-covariance with respect to $P$ .

Remark: we have assumed that both random variables share the same kernel but this need not be the case: we can consider a second kernel $k'$ with a RKHS $\mathcal H'$ ; the cross-covariance operator then belongs to the product space $\mathcal H\otimes \mathcal H'$ (which is also a RKHS).

MMD and HSIC

We have now seen two ways of embedding a distribution into a RKHS, now let's see why that can be useful. When considering a characteristic kernel (such as, for example, the Gaussian RBF with $k(x, x')=\exp(-\sigma\|x-x'\|^2_2)$ ), the RKHS embedding is injective. In that case, we can use the distance in the RKHS (induced by the inner-product) as a proxy for similarity in the distribution space. This can be used in the two-sample test (to test whether two random variables are distributed according to the same distribution) or when testing for independence between random variables.

In the kernel two sample test (Gretton et al. (2012a)), the test statistic is the squared distance between the embeddings of the two distributions:

The kernel Maximum Mean Discrepancy (MMD) measure is defined for two distributions

P

and

Q

\begin{array}{rcl} \mathrm{MMD}(P, Q) &:=& \|\mu_{_X} - \mu_{_Y}\|_{\mathcal H}^2,\end{array}

where

X\sim P

and

Y\sim Q

It is then possible to study the asymptotic properties of both the MMD and the MMD² and build hypothesis tests at a given level, independently of the distributions considered (see Gretton et al. (2012a) corollary 9 and 11).

When testing for independence, the test statistic can be the squared distance between the embeddings of the joint distribution and the product of its marginals which leads to the kernel independence test (Gretton et al. (2007)):

The Hilbert-Schmidt Independence Criterion (HSIC) is defined for two distributions

P

and

Q

\begin{array}{rcl} \mathrm{HSIC}(P, Q) &:=& \|\mathcal C_{XY} - \mu_{_X} \otimes \mu_{_Y}\|_{\mathcal H}^2,\end{array}

where

X\sim P

and

Y\sim Q

Again, it is then possible to study the asymptotic properties of the HSIC and build a hypothesis test at a given level, independently of the distributions considered (cf. (Gretton et al. (2007))).

Finite sample embeddings

All of the embeddings defined above can readily be estimated using samples drawn from the distributions of interest. Let $\{x_1, \dots, x_n\}$ be an iid. draw from $P$ , the empirical kernel mean embedding is defined as

\widehat{\mu}_{_X} \quad\!\! :=\quad\!\! n^{-1}\sum_{i=1}^n k_{x_i}.

As for standard Monte Carlo estimators, the rate of convergence is $\mathcal O(1/\sqrt{n})$ (an hence does not depend upon the dimensionality of the underlying space). Similarly, for an iid draw of pairs $\{(x_1,y_1), \dots, (x_n, y_n)\}$ , the empirical covariance operator is defined as

$\begin{aligned} \widehat{\mathcal C}_{XY} \quad\!\!&=\quad\!\! n^{-1}\sum_{i=1}^n k_{x_i} \otimes k_{y_i} \\ &=\quad\!\! n^{-1}\Upsilon\Phi^t \end{aligned}$ where $\Upsilon := (k_{x_1}, \dots, k_{x_n})$ and $\Phi:=(k_{y_1}, \dots, k_{y_n})$ are the feature matrices.

To conclude, it is straightforward to obtain empirical estimators for the MMD and HSIC criterion considering kernel elements $k(x_i, x_j)$ , $k(y_i, y_j)$ and $k(x_i, y_j)$ . In the case of the MMD for instance, one has:

\widehat{\mathrm{MMD}}(P, Q) \quad\!\! =\quad\!\! {1\over n^{2}}\sum_{i,j=1}^n \left(k(x_{i},x_{j})+k(y_{i},y_{j})-2k(x_{i},y_{j})\right)

Kernel embeddings of conditional distributions

Pointwise definition

In line with the definitions presented earlier, the kernel embedding of a conditional distribution $P(Y|X)$ is defined as

\begin{array}{rcl} \mu_{_{Y|x}} &=& \mathbb E_{Y|x}[k_{_Y}],\end{array}

and the conditional expectation of a function $g\in \mathcal H$ can be expressed as

\begin{array}{rcl} \mathbb E_{Y|x}[g(Y)] &=& \left\langle g, \mu_{_{Y|x}}\right\rangle_{\mathcal H}.\end{array}

Note that we now have a family of points in the RKHS indexed by $x$ , the value upon which we condition.

Conditional operator

We can also define an operator $\mathcal C_{Y|X}$ such that

\begin{array}{rcl} \mu_{_{Y|x}} &=& \mathcal C_{Y|X} k_x.\end{array}

To do so, we must first introduce a result for which we will provide a partial proof (the full proof can be found in Fukumizu et al. (2004)).

The following identity holds (under mild technical conditions):

\begin{array}{rcl} \mathcal C_{XX} \mathbb E_{Y|X}[g(Y)] &=& \mathcal C_{XY}g. \end{array}

To prove this, note that for $f \in \mathcal H$ , using the definition of the joint embedding, we have

$\begin{aligned} \left\langle f, \mathcal C_{XX} \mathbb E_{Y|X}[g(Y)]\right\rangle_{\mathcal H} \quad\!\! &=\quad\!\! \mathbb E_X[f(X)\mathbb E_{Y|X}[g(Y)]] \\ &=\quad\!\! \mathbb E_X[\mathbb E_{Y|X}[f(X)g(Y)]] \quad\!\! =\quad\!\! \mathbb E_{XY}[f(X)g(Y)],\end{aligned}$ where at the last equality, we used the tower property of expectations. Comparing with (8) we get (17).

With this, we can show another nice identity. Using the definition of $\mu_{Y|x}$ , we have $\begin{array}{rcl} \left\langle g, \mu_{Y|x}\right\rangle_{\mathcal H} &=& \mathbb E_{Y|x}[g(Y)] \quad\!\! =\quad\!\! \left\langle \mathbb E_{Y|X}[g(Y)], k_x\right\rangle_{\mathcal H}.\end{array}$ But, using (17), we have $\begin{array}{rcl} \left\langle \mathbb E_{Y|X}[g(Y)], k_x\right\rangle_{\mathcal H} &=& \left\langle \mathcal C_{XX}^{-1}\mathcal C_{XY} g, k_x\right\rangle_{\mathcal H} \quad\!\! =\quad\!\! \left\langle g, \mathcal C_{YX}C_{XX}^{-1} k_x\right\rangle_{\mathcal H}\end{array}$ where at the last equality we took the adjoint of $\mathcal C_{XX}^{-1}\mathcal C_{XY}$ which allows us to introduce the definition that follows.

The conditional embedding operator is defined as

\mathcal C_{Y|X} \quad\!\! :=\quad\!\! \mathcal C_{YX}\mathcal C_{XX}^{-1}.

In practice, $\mathcal C_{XX}$ is a compact operator which means that its eigenvalues go to zero and hence its inverse is not a bounded operator. So the definition of $\mathcal C_{Y|X}$ given above is a slight abuse of notation. The inversion of $\mathcal C_{XX}$ can be replaced by the regularised inverse $(\mathcal C_{XX}+\lambda I)^{-1}$ where $\lambda$ is a positive factor that can be determined by cross-validation.

Finite sample kernel estimator

If we consider an iid draw $\{(x_i, y_i)\}_{i=1}^m$ from a joint distribution $P$ , we know that the empirical estimators for $\mathcal C_{YX}$ and $\mathcal C_{XX}$ can be written as

\widehat{\mathcal C_{YX}} \quad\!\! =\quad\!\! {1\over n}\Phi\Upsilon^t\quad\!\!\text{and}\quad\!\! \widehat{\mathcal C_{XX}} \quad\!\! =\quad\!\! = {1\over n}\Upsilon\Upsilon^t,

where $\Phi$ and $\Upsilon$ are defined as before (see equation (12)). Using a trick from linear algebra for the regularised inverse (coming from an application of Woodbury's formula), we have:

\widehat{\mathcal C_{Y|X}} \quad\!\! =\quad\!\! n^{-1} \Phi \left[\Upsilon^t\left(\lambda I + n^{-1}\Upsilon\Upsilon^t\right)\right] \quad\!\! =\quad\!\! n^{-1}\Phi\left[n(\lambda I + \Upsilon^t\Upsilon)\Upsilon^t\right],

which gives an estimator for the conditional embedding operator.

A finite sample estimator for the conditional embedding operator is

\widehat{\mathcal C_{Y|X}} \quad\!\! =\quad\!\! \Phi[K+\lambda I]^{-1}\Upsilon^t

where $K:=\Upsilon^t\Upsilon$ is the Gram matrix.

The regularisation parameter $\lambda$ helps to control for overfitting. The resulting kernel embedding is then

\widehat\mu_{Y|x} \quad\!\! =\quad\!\! \Phi\beta(x)

where $\beta(x):= (K+\lambda I)^{-1}K_{:x}$ and $K_{:x}:=[k(x, x_i)]_{i=1}^m$ . It is thus a weighted sum of samples of $Y$ in the feature space with weights depending on the conditioning variable $x$ .

References

Fukumizu, Bach and Jordan, Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces, 2004. A key paper in the RKHS literature.
Gretton, Fukumizu, Hui Teo, Le Song, Schölkopf, Smola, A kernel statistical test of independence, 2007. Paper describing how to perform an independence test using the HSIC.
Gretton, Borgwardt, Rasch, Schölkopf, Smola, A kernel two-sample test, 2012. Paper describing how to perform a two-sample test using the MMD.
Gretton, Sriperumbudur, Sejdinovic, Strathmann, Balakrishnan, Pontil and Fukumizu, Optimal kernel choice for large-scale two-sample tests, 2012. Another paper describing how to perform a two-sample test using the MMD focusing on computational perspectives.
Song, Fukumizu and Gretton, Kernel embeddings of conditional distributions, 2013. The paper these notes are mainly based on.