Introduction to RKHS
These notes were prepared with help from Dino Sejdinovic for a talk to a kernel methods reading group given in Oxford in July 2015. The notes are fairly dry, I intend to come back and add more comments on the intuition in the future (in particular for the second part).
Basic definitions
If we consider a RKHS then, by Riesz's representation theorem, since is a continuous functional, it has a representer in that we can denote such that
We can then define a (positive-definite) bilinear form as . This is known as the reproducing kernel of ; we will also write .
There's then an important theorem that links the two (and that we won't prove, though a detailed proof can be found here) reproduced in a simplified form below:
When the kernel is clear from the context, we will simply write for the RKHS and for its reproducing kernel.
In essence we consider spaces of functions that can be characterised by a kernel function. As will be seen, this kernel function encapsulates important properties of the type of functions that can be found in the space (e.g. how smooth they are etc.).
Kernel embedding of a distribution
A classical way to try to represent points in a space is to embed them in using a -dimensional feature map with Instead, we can now consider embedding points in a RKHS with the infinite dimensional feature map . In that case, we have an easily computable inner product between points with
Recall that an inner product is a measure of alignment so that this automatically gives us a measure of similarity between points through this kernel.
When the embedding is injective (i.e.: different objects are mapped to different points in the RKHS), the corresponding kernel is said to be characteristic (this is often the case for standard kernels).
Mean embedding
An example of objects we can embed in an RKHS are probability distributions. Each distribution is then considered as a point which we can embed through the mean-embedding.
Let denote a distribution among a set of distributions over some , the mean embedding is defined as follows:
where denotes the expectation with respect to the random variable distributed as . Naturally, . When the kernel and distribution are clear from the context, we will simply write where, implicitly .
Observe that we now have this continuous embedding instead of a finite-dimensional embedding that we could have considered such as . Also, as before, we inherit a notion of similarity between points (here probability distributions) by considering the inner product on the RKHS: and this can easily be estimated if we have samples from both and .
Note finally that represents expectations with respect to i.e.: for any ,
Joint embedding
The generalisation to joint distributions is straightforward using tensor product feature spaces.
The tensor product satisfies .
In the same way that represents the expectation operator, the joint-embedding can be viewed as the uncentered cross-covariance operator: for any two functions , their uncentered covariance is given by
Following the same reasoning, we can define the auto-covariance operators and . Note also that, where represents expectations with respect to (the distribution of ), these operators represent cross-covariance/auto-covariance with respect to .
Remark: we have assumed that both random variables share the same kernel but this need not be the case: we can consider a second kernel with a RKHS ; the cross-covariance operator then belongs to the product space (which is also a RKHS).
MMD and HSIC
We have now seen two ways of embedding a distribution into a RKHS, now let's see why that can be useful. When considering a characteristic kernel (such as, for example, the Gaussian RBF with ), the RKHS embedding is injective. In that case, we can use the distance in the RKHS (induced by the inner-product) as a proxy for similarity in the distribution space. This can be used in the two-sample test (to test whether two random variables are distributed according to the same distribution) or when testing for independence between random variables.
In the kernel two sample test (Gretton et al. (2012a)), the test statistic is the squared distance between the embeddings of the two distributions:
It is then possible to study the asymptotic properties of both the MMD and the MMD² and build hypothesis tests at a given level, independently of the distributions considered (see Gretton et al. (2012a) corollary 9 and 11).
When testing for independence, the test statistic can be the squared distance between the embeddings of the joint distribution and the product of its marginals which leads to the kernel independence test (Gretton et al. (2007)):
Again, it is then possible to study the asymptotic properties of the HSIC and build a hypothesis test at a given level, independently of the distributions considered (cf. (Gretton et al. (2007))).
Finite sample embeddings
All of the embeddings defined above can readily be estimated using samples drawn from the distributions of interest. Let be an iid. draw from , the empirical kernel mean embedding is defined as
As for standard Monte Carlo estimators, the rate of convergence is (an hence does not depend upon the dimensionality of the underlying space). Similarly, for an iid draw of pairs , the empirical covariance operator is defined as
where and are the feature matrices.
To conclude, it is straightforward to obtain empirical estimators for the MMD and HSIC criterion considering kernel elements , and . In the case of the MMD for instance, one has:
Kernel embeddings of conditional distributions
Pointwise definition
In line with the definitions presented earlier, the kernel embedding of a conditional distribution is defined as
and the conditional expectation of a function can be expressed as
Note that we now have a family of points in the RKHS indexed by , the value upon which we condition.
Conditional operator
We can also define an operator such that
To do so, we must first introduce a result for which we will provide a partial proof (the full proof can be found in Fukumizu et al. (2004)).
To prove this, note that for , using the definition of the joint embedding, we have
where at the last equality, we used the tower property of expectations. Comparing with (8) we get (17).
With this, we can show another nice identity. Using the definition of , we have But, using (17), we have where at the last equality we took the adjoint of which allows us to introduce the definition that follows.
The conditional embedding operator is defined as
In practice, is a compact operator which means that its eigenvalues go to zero and hence its inverse is not a bounded operator. So the definition of given above is a slight abuse of notation. The inversion of can be replaced by the regularised inverse where is a positive factor that can be determined by cross-validation.
Finite sample kernel estimator
If we consider an iid draw from a joint distribution , we know that the empirical estimators for and can be written as
where and are defined as before (see equation (12)). Using a trick from linear algebra for the regularised inverse (coming from an application of Woodbury's formula), we have:
which gives an estimator for the conditional embedding operator.
A finite sample estimator for the conditional embedding operator is
where is the Gram matrix.
The regularisation parameter helps to control for overfitting. The resulting kernel embedding is then
where and . It is thus a weighted sum of samples of in the feature space with weights depending on the conditioning variable .
References
Fukumizu, Bach and Jordan, Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces, 2004. A key paper in the RKHS literature.
Gretton, Fukumizu, Hui Teo, Le Song, Schölkopf, Smola, A kernel statistical test of independence, 2007. Paper describing how to perform an independence test using the HSIC.
Gretton, Borgwardt, Rasch, Schölkopf, Smola, A kernel two-sample test, 2012. Paper describing how to perform a two-sample test using the MMD.
Gretton, Sriperumbudur, Sejdinovic, Strathmann, Balakrishnan, Pontil and Fukumizu, Optimal kernel choice for large-scale two-sample tests, 2012. Another paper describing how to perform a two-sample test using the MMD focusing on computational perspectives.
Song, Fukumizu and Gretton, Kernel embeddings of conditional distributions, 2013. The paper these notes are mainly based on.