CV Ridge (LOO)

Brief recap

The Ridge regression is a simple penalised regression model corresponding to a L2 loss with L2 penalty on the coefficients:

βridge ⁣ ⁣= ⁣ ⁣argminXβy22+λβ22, β_{\text{ridge}} \quad\!\! =\quad\!\! \arg\min \|X\beta - y\|_2^2 + \lambda \|\beta\|_2^2,

where XX is n×pn\times p and λ>0\lambda > 0 is the penalty (hyper)parameter. We'll drop the subscript "ridge" as it's now obvious from the context.

The minimum verifies

(XtX+λI)β=Xty. (X^tX + λ I) β \quad = \quad X^ty.

Complexity of solving Ridge

Tall case (n>pn>p)

The computational cost of solving (2) directly is O(np2+p3)O(np^2 + p^3):

In the case where n>pn > p (tall case), this is asymptotically dominated by the first term: O(np2)O(np^2).

Fat case (p>np>n)

In the case where p>np > n (fat case), it is more efficient to solve a derived system obtained by pre-multiplying (2) by XX:

(XXt+λI)Xβ=XXty (XX^t + λ I)Xβ \quad = \quad XX^t y

indeed, in that case XXtXX^t is n×nn\times n which is then smaller than p×pp \times p. It is useful to show a way to solve this explicitly: consider the SVD X=UΣVtX = UΣ V^t, then XXt=UΣ2UtXX^t = UΣ^2 U^t and the equation (3) becomes

U(Σ2+λI)ΣVtβ=UΣ2Uty U(Σ^2 + λ I) Σ V^t β \quad=\quad U Σ^2U^t y

taking advantage from the fact that UU is orthogonal so that UUt=IUU^t = I (same with VV). This can be further massaged into

β=VΣ(Σ2+λI)1Uty=XtU(Σ2+λI)1Uty,\begin{array}{rcl} β &=& V Σ (Σ^2 + \lambda I)^{-1}U^t y \\ &=& X^t U(Σ^2 + \lambda I)^{-1}U^t y,\end{array}

using that X=UΣVtX=UΣ V^t so that V=XtUΣ1V = X^t U Σ^{-1}. Overall, the complexity is asymptotically dominated by the construction of XXtXX^t which is O(pn2)O(pn^2) followed by the complexity of computing its SVD which is O(n3)O(n^3).

Conclusion

Recycling computations when changing λλ

When tuning the hyperparameter λλ, we will potentially want to compute βλβ_λ for a number of different λλ; in Ridge regression we can ensure that this is done efficiently so that trying out different λλ is cheap after having computed the first solution.

We will again consider separately the tall and fat case, and will assume that it is reasonable to form either XtXX^tX or XXtXX^t, for each case we will consider its SVD.

Tall case

Let X=UΣVtX = UΣV^t and consequently XtX=VΣ2VtX^tX = V\Sigma^2V^t then (2) can be written

V(Σ2+λI)Vtβ=Xty V(Σ^2 + λ I)V^t β \quad =\quad X^ty

and consequently β=V(Σ2+λI)1Xty\beta = V(Σ^2 + λ I)^{-1}X^ty.

Once the SVD has been formed, we are left, for any λλ, with computing VDλzVD_λz where z=Xtyz = X^ty can be computed once in O(np)O(np), Dλ=(Σ2+λI)1D_λ = (Σ^2 + λ I)^{-1} is a diagonal matrix which is computed and applied in O(p)O(p) and the application of VV is O(p2)O(p^2).

Overall:

OperationCost
initial computation of XtXX^tXO(np2)O(np^2)
initial SVD of XtXX^tXO(p3)O(p^3)
computation of VDλzVD_λzO(p2)O(p^2) per λλ

in other words, computing the solution for O(n)O(n) different λ\lambda is only twice as expensive asymptotically as computing the solution for a single one.

Fat case

In the fat case, we already have the equation (5): β=XtU(Σ2+λI)1Utyβ = X^t U(Σ^2+λI)^{-1}U^ty, we get an analogous argument:

OperationDominating Cost
initial computation of XXtXX^tO(pn2)O(pn^2)
initial SVD of XXtXX^tO(n3)O(n^3)
initial computation of w=Utyw = U^tyO(pn)O(pn)
computation of XtUDλwX^t U D_λ wO(np)O(np) per λλ

in other words, computing the solution for O(n)O(n) different λ\lambda is only twice as expensive asymptotically as computing the solution for a single one.

Notes:

Simple code

We can easily check all this with code in Julia; for instance, let's consider the fat case:

using LinearAlgebra

basic_ridge_solve(X, y, λ=1) = (X'X + λ*I) \ X'y

X_fat = randn(50, 500)
y_fat = randn(50)

λ = 1

# slow solve requiring the solution of a 500x500 system
β_fat = basic_ridge_solve(X_fat, y_fat, λ);

Now let's check that we can recover it the cheaper way (via the SVD of XXtXX^t)

# SVD of a 50x50 system
F = svd(Symmetric(X_fat * X_fat'))
U, S = F.U, F.S
w_fat = U'y_fat ./ (S .+ λ)
β_fat ≈ X_fat' * (U * w_fat)
true

Now let's change λλ:

λ′ = 3

# naive route
β_fat′ = basic_ridge_solve(X_fat, y_fat, λ′)

# efficient route
w_fat′ = U'y_fat ./ (S .+ λ′)
β_fat′ ≈ X_fat' * (U * w_fat′)
true

LOOCV trick

In Leave-One-Out CV, for a given λλ we want to train the model for each of n1n-1 cases where we consider only n1n-1 data points and want to report the error on the last point.

Let X(i)X_{(i)} (resp y(i)y_{(i)}) be the matrix (resp. vector) with the iith row removed, and let β(i)β_{(i)} be the Ridge coefficients in that case. We are interested in the predicted error on the dropped point:

ei=xitβ(i)yi e_i \quad=\quad x_i^tβ_{(i)} - y_i

We know that for a fixed ii, we can compute that error efficiently for many λ\lambda but it does require one initial SVD. That's a bit annoying because we have to do that for every ii, meaning a lot of SVDs to compute; this seems silly, surely we can re-use some computations!

It's well known that we can indeed speed things up. See for instance Rifkin and Lippert (2007) whose formula is used in Sklearn's RidgeCV model, or Golub, Heath and Wahba (1978) for a statistical perspective. The presentation below is different but the gist in terms of the complexity gain is the same.

Tall case

Let us write X(i)=Ω(i)XX_{(i)} = Ω_{(i)} X where ΩiΩ_i is the identity matrix with the iith row removed. We then also have y(i)=Ω(i)yy_{(i)}=\Omega_{(i)}y. It is not hard to show that Ω(i)tΩ(i)=(ID(i))Ω_{(i)}^tΩ_{(i)} = (I - D_{(i)}) where D(i)D_{(i)} is an all-zero matrix with a 1 on the iith diagonal element.

The iith Ridge solution verifies (X(i)tX(i)+λI)β(i)=X(i)ty(i)(X_{(i)}^tX_{(i)} + λI)β_{(i)} = X_{(i)}^t y_{(i)}. Using the notations introduced in the previous paragraph, we can rewrite this as

((XtX+λI)XtD(i)X)β(i)=Xt(ID(i))y ((X^tX + λI) - X^tD_{(i)}X)β_{(i)} \quad=\quad X^t(I-D_{(i)})y

If we write H=(XtX+λI)H = (X^tX + λI), then we know that it can be efficiently inverted assuming we have initially computed the SVD of XtXX^tX. Note also that XtD(i)X=xixitX^tD_{(i)}X = x_i x_i^t.

On the left-hand side of (8) we have therefore a rank-1 perturbation of an invertible matrix. The inverse of such a matrix can be readily expressed using the Sherman-Morrison formula (see the page on matrix inversion lemmas for details):

(Hxixit)1=H1+H1xixitH11xitH1xi. (H - x_ix_i^t)^{-1} \quad=\quad H^{-1} + {H^{-1}x_i x_i^t H^{-1} \over 1 - x_i^t H^{-1} x_i}.

Plugging this in the lhs of (8), we can get an explicit expression for β(i)β_{(i)}. But remember that what we really want is eie_i so that we're more interested in xitβ(i)x_i^t β_{(i)}:

xitβ(i)=(xitH1+xitH1xixitH11xitH1xi)Xt(ID(i))y. x_i^tβ_{(i)} \quad=\quad \left(x_i^t H^{-1} + {x_i^tH^{-1}x_i x_i^t H^{-1}\over 1-x_i^tH^{-1}x_i} \right)X^t(I-D_{(i)})y.

As it turns out, this can be simplified a fair bit. Note that the second factor can be re-written XtyXtD(i)yX^ty - X^tD_{(i)}y but that last term is simply xiyix_iy_i; let also z=Xtyz=X^ty which can be pre-computed and let γ=xitH1xiγ=x_i^tH^{-1}x_i. Then, massaging a bit, we get

xitβ(i)=xitH1zγyi1γ, x_i^tβ_{(i)} \quad=\quad {x_i^t H^{-1}z - γy_i\over 1-γ},

and, correspondingly, ei=(xitH1zyi)/(1γ)e_i = (x_i^tH^{-1}z - y_i) / (1-γ).

To conclude, with H1=V(Σ2+λI)1VtH^{-1} = V(Σ^2+λI)^{-1}V^t, w=Vtzw = V^tz, gi=Vtxig_i=V^tx_i and g~i=(Σ2+λI)1gig̃_i = (Σ^2+λI)^{-1}g_i, we can rewrite the computation of eie_i as

ei=g~itwyi1g~itgi e_i \quad = \quad {g̃_i^t w - y_i \over 1 - g̃_i^tg_i}

and the cost of computing all the LOO errors is:

OperationDominating complexity
form XtXX^tX and compute its SVDO(np2)O(np^2)
pre-compute z=Xtyz=X^ty and w=Vtzw=V^tzO(pn)O(pn)
(i∀i) compute gi=Vtxig_i = V^tx_iO(p2)O(p^2)
(i,λ∀i,λ) compute g~i=(Σ2+λI)1gig̃_i = (Σ^2+λI)^{-1} g_iO(p)O(p)
(i,λ∀i,λ) compute ei=(g~itwyi)/(1g~itgi)e_i = (g̃_i^tw - y_i) / (1-g̃_i^tg_i)O(p)O(p)

So overall, the initial setup cost is dominated by O(np2)O(np^2) and the subsequent cost is that of computing each gig_i: O(np2)O(np^2); finally there's a cost O(npκ)O(npκ) where κκ is the number of λλ tested.

In other words, modulo constant factors, it costs the same to get all the LOO errors for O(p)O(p) different λ\lambda than to get a single Ridge solution!

Fat case

When p>np > n, we already know that it's beneficial to consider the SVD of XXt=UΣ2UtXX^t = UΣ^2U^t instead of that of XtXX^tX. Here it turns out that just taking (12) and computing gi,g~ig_i, g̃_i and ww in terms of UU and XX instead of VV is all we need to do; for this recall that X=UΣVtX = UΣV^t so that Vt=Σ1UtXV^t = Σ^{-1}U^tX:

OperationDominating complexity
form K=XXtK = XX^t and compute its SVDO(pn2)O(pn^2)
pre-compute w=Σ1UtKyw=Σ^{-1}U^t K yO(n2)O(n^2)
(i∀i) compute gi=Σ1UtXxig_i = Σ^{-1} U^t X x_iO(np)O(np)
(i,λ∀i,λ) compute g~i=(Σ2+λI)1gig̃_i = (Σ^2+λI)^{-1} g_iO(n)O(n)
(i,λ∀i,λ) compute ei=(g~itwyi)/(1g~itgi)e_i = (g̃_i^tw - y_i) / (1-g̃_i^tg_i)O(n)O(n)

So overall, the initial setup cost is dominated by O(pn2)O(pn^2) and the subsequent cost is that of computing each gig_i: O(pn2)O(pn^2); finally there's a cost O(n2κ)O(n^2κ) where κκ is the number of λλ tested.

Again, modulo constant factors, it costs the same to get all the LOO errors for O(p)O(p) different λλ than to get a single Ridge solution.

Simple code

We can easily check all this with code again, let's extend the previous case:

using InvertedIndices
i = 5   # random index between 1 and n
λ = 2.5

X̃ᵢ = X_fat[Not(i),:]
ỹᵢ = y_fat[Not(i),:]
β̃ᵢ = basic_ridge_solve(X̃ᵢ, ỹᵢ, λ)

xᵢ = vec(X_fat[i,:])
yᵢ = y_fat[i]
rᵢ = dot(xᵢ, β̃ᵢ) - yᵢ; # this the i-th LOO residual

Now let's show we can recover this using the previous point:

K = Symmetric(X_fat * X_fat')
F = svd(K)
U, Ut, S = F.V, F.Vt, F.S
Σ² = S
Σ  = sqrt.(S)

w  = (Ut * (K * y_fat)) ./ Σ
gᵢ = (Ut * (X_fat * xᵢ)) ./ Σ
g̃ᵢ = gᵢ ./ (Σ² .+ λ)

rᵢ ≈ (dot(g̃ᵢ, w) - yᵢ) / (1 - dot(g̃ᵢ, gᵢ))
true

Now if we change ii and λλ,

i = 7
λ = 5.5

X̃ᵢ = X_fat[Not(i),:]
ỹᵢ = y_fat[Not(i),:]
β̃ᵢ = basic_ridge_solve(X̃ᵢ, ỹᵢ, λ)

xᵢ = vec(X_fat[i,:])
yᵢ = y_fat[i]
rᵢ = dot(xᵢ, β̃ᵢ) - yᵢ;

we only have to recompute gᵢ and g̃ᵢ:

gᵢ = (Ut * (X_fat * xᵢ)) ./ Σ
g̃ᵢ = gᵢ ./ (Σ² .+ λ)

rᵢ ≈ (dot(g̃ᵢ, w) - yᵢ) / (1 - dot(g̃ᵢ, gᵢ))
true

Conclusion

In short, provided that computing the SVD of either a p×pp \times p or n×nn\times n matrix is manageable, it's not much more expensive to compute a single Ridge solution than to compute a bunch of solutions or to compute all the LOO errors for a number of different λλ.

This LOO trick can be generalised to an extent to all leave-some-out schemes; this is covered in the second part.

References

  1. Rifkin and Lippert, Notes on Regularized Least Squares, 2007. – Detailed development of a way to get the LOO error in (Kernel) Ridge efficiently, note that their development is not the same than the one presented here but the end result is comparable. Their formula is the one implemented in Sklearn's RidgeCV.

  2. Golub, Heath and Wahba, Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter, 1978. – Introduction of the GCV parameter, appropriate for selecting λλ in the tall case.