These notes were prepared with help from Dino Sejdinovic for a talk to a kernel methods reading group given in Oxford in July 2015. The notes are fairly dry, I intend to come back and add more comments on the intuition in the future (in particular for the second part).
A Hilbert space H of functions f:X↦R defined on a non-empty set X is said to be a reproducing kernel Hilbert space (RKHS) if evaluation functionalsδx:f↦f(x) are continuous for all x∈X
If we consider a RKHS then, by Riesz's representation theorem, since δx is a continuous functional, it has a representer in H that we can denote kx such that
⟨f,kx⟩H=δx(f)=f(x).
We can then define a (positive-definite) bilinear form k:X×X→R as k(x,x′):=⟨kx,kx′⟩H. This is known as the reproducing kernel of H; we will also write kx=k(⋅,x).
There's then an important theorem that links the two (and that we won't prove, though a detailed proof can be found here) reproduced in a simplified form below:
(Moore-Aronszajn) Every positive-definite bilinear form k is a reproducing kernel for some Hilbert space Hk.
When the kernel is clear from the context, we will simply write H for the RKHS and k for its reproducing kernel.
In essence we consider spaces of functions that can be characterised by a kernel function. As will be seen, this kernel function encapsulates important properties of the type of functions that can be found in the space (e.g. how smooth they are etc.).
A classical way to try to represent points in a space X is to embed them in Rs using a s-dimensional feature mapΦ:X→Rs with x↦Φ(x)=(φ1(x),φ2(x),…,φs(x)). Instead, we can now consider embedding points in a RKHS with the infinite dimensional feature map x↦kx. In that case, we have an easily computable inner product between points with
⟨kx,ky⟩H=⟨k(⋅,x),k(⋅,y)⟩H=k(x,y).
Recall that an inner product is a measure of alignment so that this automatically gives us a measure of similarity between points through this kernel.
When the embedding is injective (i.e.: different objects are mapped to different points in the RKHS), the corresponding kernel is said to be characteristic (this is often the case for standard kernels).
An example of objects we can embed in an RKHS are probability distributions. Each distribution is then considered as a point which we can embed through the mean-embedding.
Let P denote a distribution among a set P(Ω) of distributions over some Ω, the mean embedding is defined as follows:
P↦μ(P;k):=EX∼P[k(⋅,X)]=EX∼P[kX],
where EX∼P denotes the expectation with respect to the random variable X distributed as P. Naturally, μ(P;k)∈H. When the kernel and distribution are clear from the context, we will simply write μ(P;k)=EX[kX]=:μX where, implicitly X∼P.
Observe that we now have this continuous embedding instead of a finite-dimensional embedding that we could have considered such as P↦(E[φ1(X)],…,E[φs(X)]). Also, as before, we inherit a notion of similarity between points (here probability distributions) by considering the inner product on the RKHS: ⟨μ(P;k),μ(Q;k)⟩H=EXY[k(X,Y)], and this can easily be estimated if we have samples from both P and Q.
Note finally that μX represents expectations with respect to P i.e.: for any f∈H, ⟨f,μX⟩H=EX[⟨f,kX⟩H]=EX[f(X)].
The generalisation to joint distributions is straightforward using tensor product feature spaces.
Let X,Y be jointly distributed according to some distribution P, the joint embedding of P is defined as P↦CXY(P):=EXY[kX⊗kY], assuming we use the same kernel for both variables.
The tensor product satisfies ⟨kx⊗ky,kx′⊗ky′⟩H⊗H=k(x,x′)k(y,y′).
In the same way that μX represents the expectation operator, the joint-embedding CXY can be viewed as the uncentered cross-covariance operator: for any two functions f,g∈H, their uncentered covariance is given by
EXY[f(X)g(Y)]=⟨f⊗g,CXY⟩H⊗H=⟨f,CXYg⟩H.
Following the same reasoning, we can define the auto-covariance operatorsCXX and CYY. Note also that, where μX represents expectations with respect to P (the distribution of X), these operators represent cross-covariance/auto-covariance with respect to P.
Remark: we have assumed that both random variables share the same kernel but this need not be the case: we can consider a second kernel k′ with a RKHS H′; the cross-covariance operator then belongs to the product space H⊗H′ (which is also a RKHS).
We have now seen two ways of embedding a distribution into a RKHS, now let's see why that can be useful. When considering a characteristic kernel (such as, for example, the Gaussian RBF with k(x,x′)=exp(−σ∥x−x′∥22)), the RKHS embedding is injective. In that case, we can use the distance in the RKHS (induced by the inner-product) as a proxy for similarity in the distribution space. This can be used in the two-sample test (to test whether two random variables are distributed according to the same distribution) or when testing for independence between random variables.
In the kernel two sample test(Gretton et al. (2012a)), the test statistic is the squared distance between the embeddings of the two distributions:
The kernel Maximum Mean Discrepancy (MMD) measure is defined for two distributions P and Q by MMD(P,Q):=∥μX−μY∥H2, where X∼P and Y∼Q.
It is then possible to study the asymptotic properties of both the MMD and the MMD² and build hypothesis tests at a given level, independently of the distributions considered (see Gretton et al. (2012a) corollary 9 and 11).
When testing for independence, the test statistic can be the squared distance between the embeddings of the joint distribution and the product of its marginals which leads to the kernel independence test(Gretton et al. (2007)):
The Hilbert-Schmidt Independence Criterion (HSIC) is defined for two distributions P and Q by HSIC(P,Q):=∥CXY−μX⊗μY∥H2, where X∼P and Y∼Q.
Again, it is then possible to study the asymptotic properties of the HSIC and build a hypothesis test at a given level, independently of the distributions considered (cf. (Gretton et al. (2007))).
All of the embeddings defined above can readily be estimated using samples drawn from the distributions of interest. Let {x1,…,xn} be an iid. draw from P, the empirical kernel mean embedding is defined as
μX:=n−1i=1∑nkxi.
As for standard Monte Carlo estimators, the rate of convergence is O(1/n) (an hence does not depend upon the dimensionality of the underlying space). Similarly, for an iid draw of pairs {(x1,y1),…,(xn,yn)}, the empirical covariance operator is defined as
CXY=n−1i=1∑nkxi⊗kyi=n−1ΥΦt where Υ:=(kx1,…,kxn) and Φ:=(ky1,…,kyn) are the feature matrices.
To conclude, it is straightforward to obtain empirical estimators for the MMD and HSIC criterion considering kernel elements k(xi,xj), k(yi,yj) and k(xi,yj). In the case of the MMD for instance, one has:
To do so, we must first introduce a result for which we will provide a partial proof (the full proof can be found in Fukumizu et al. (2004)).
The following identity holds (under mild technical conditions):
CXXEY∣X[g(Y)]=CXYg.
To prove this, note that for f∈H, using the definition of the joint embedding, we have
⟨f,CXXEY∣X[g(Y)]⟩H=EX[f(X)EY∣X[g(Y)]]=EX[EY∣X[f(X)g(Y)]]=EXY[f(X)g(Y)], where at the last equality, we used the tower property of expectations. Comparing with (8) we get (17).
With this, we can show another nice identity. Using the definition of μY∣x, we have ⟨g,μY∣x⟩H=EY∣x[g(Y)]=⟨EY∣X[g(Y)],kx⟩H. But, using (17), we have ⟨EY∣X[g(Y)],kx⟩H=⟨CXX−1CXYg,kx⟩H=⟨g,CYXCXX−1kx⟩H where at the last equality we took the adjoint of CXX−1CXY which allows us to introduce the definition that follows.
The conditional embedding operator is defined as
CY∣X:=CYXCXX−1.
In practice, CXX is a compact operator which means that its eigenvalues go to zero and hence its inverse is not a bounded operator. So the definition of CY∣X given above is a slight abuse of notation. The inversion of CXX can be replaced by the regularised inverse(CXX+λI)−1 where λ is a positive factor that can be determined by cross-validation.
If we consider an iid draw {(xi,yi)}i=1m from a joint distribution P, we know that the empirical estimators for CYX and CXX can be written as
CYX=n1ΦΥtandCXX==n1ΥΥt,
where Φ and Υ are defined as before (see equation (12)). Using a trick from linear algebra for the regularised inverse (coming from an application of Woodbury's formula), we have:
CY∣X=n−1Φ[Υt(λI+n−1ΥΥt)]=n−1Φ[n(λI+ΥtΥ)Υt],
which gives an estimator for the conditional embedding operator.
A finite sample estimator for the conditional embedding operator is
CY∣X=Φ[K+λI]−1Υt
where K:=ΥtΥ is the Gram matrix.
The regularisation parameter λ helps to control for overfitting. The resulting kernel embedding is then
μY∣x=Φβ(x)
where β(x):=(K+λI)−1K:x and K:x:=[k(x,xi)]i=1m . It is thus a weighted sum of samples of Y in the feature space with weights depending on the conditioning variable x.
Gretton, Fukumizu, Hui Teo, Le Song, Schölkopf, Smola, A kernel statistical test of independence, 2007. Paper describing how to perform an independence test using the HSIC.
Gretton, Borgwardt, Rasch, Schölkopf, Smola, A kernel two-sample test, 2012. Paper describing how to perform a two-sample test using the MMD.
Gretton, Sriperumbudur, Sejdinovic, Strathmann, Balakrishnan, Pontil and Fukumizu, Optimal kernel choice for large-scale two-sample tests, 2012. Another paper describing how to perform a two-sample test using the MMD focusing on computational perspectives.