CV Ridge (LOO)
Brief recap
The Ridge regression is a simple penalised regression model corresponding to a L2 loss with L2 penalty on the coefficients:
where is and is the penalty (hyper)parameter. We'll drop the subscript "ridge" as it's now obvious from the context.
The minimum verifies
Complexity of solving Ridge
Tall case ()
The computational cost of solving (2) directly is :
building is ,
solving the resulting system is .
In the case where (tall case), this is asymptotically dominated by the first term: .
Fat case ()
In the case where (fat case), it is more efficient to solve a derived system obtained by pre-multiplying (2) by :
indeed, in that case is which is then smaller than . It is useful to show a way to solve this explicitly: consider the SVD , then and the equation (3) becomes
taking advantage from the fact that is orthogonal so that (same with ). This can be further massaged into
using that so that . Overall, the complexity is asymptotically dominated by the construction of which is followed by the complexity of computing its SVD which is .
Conclusion
tall case: , , dominated by the construction of ,
fat case: , , dominated by the construction of .
Recycling computations when changing
When tuning the hyperparameter , we will potentially want to compute for a number of different ; in Ridge regression we can ensure that this is done efficiently so that trying out different is cheap after having computed the first solution.
We will again consider separately the tall and fat case, and will assume that it is reasonable to form either or , for each case we will consider its SVD.
Tall case
Let and consequently then (2) can be written
and consequently .
Once the SVD has been formed, we are left, for any , with computing where can be computed once in , is a diagonal matrix which is computed and applied in and the application of is .
Overall:
Operation | Cost |
---|---|
initial computation of | |
initial SVD of | |
computation of | per |
in other words, computing the solution for different is only twice as expensive asymptotically as computing the solution for a single one.
Fat case
In the fat case, we already have the equation (5): , we get an analogous argument:
Operation | Dominating Cost |
---|---|
initial computation of | |
initial SVD of | |
initial computation of | |
computation of | per |
in other words, computing the solution for different is only twice as expensive asymptotically as computing the solution for a single one.
Notes:
Observe that for both the tall and fat case, paying around twice the cost of the initial computation, allows to compute the solutions for different .
thus far we have not considered the computation of the intercept but it's not hard to do and does not change the complexity analysis.
Simple code
We can easily check all this with code in Julia; for instance, let's consider the fat case:
using LinearAlgebra
basic_ridge_solve(X, y, λ=1) = (X'X + λ*I) \ X'y
X_fat = randn(50, 500)
y_fat = randn(50)
λ = 1
# slow solve requiring the solution of a 500x500 system
β_fat = basic_ridge_solve(X_fat, y_fat, λ);
Now let's check that we can recover it the cheaper way (via the SVD of )
# SVD of a 50x50 system
F = svd(Symmetric(X_fat * X_fat'))
U, S = F.U, F.S
w_fat = U'y_fat ./ (S .+ λ)
β_fat ≈ X_fat' * (U * w_fat)
true
Now let's change :
λ′ = 3
# naive route
β_fat′ = basic_ridge_solve(X_fat, y_fat, λ′)
# efficient route
w_fat′ = U'y_fat ./ (S .+ λ′)
β_fat′ ≈ X_fat' * (U * w_fat′)
true
LOOCV trick
In Leave-One-Out CV, for a given we want to train the model for each of cases where we consider only data points and want to report the error on the last point.
Let (resp ) be the matrix (resp. vector) with the th row removed, and let be the Ridge coefficients in that case. We are interested in the predicted error on the dropped point:
We know that for a fixed , we can compute that error efficiently for many but it does require one initial SVD. That's a bit annoying because we have to do that for every , meaning a lot of SVDs to compute; this seems silly, surely we can re-use some computations!
It's well known that we can indeed speed things up. See for instance Rifkin and Lippert (2007) whose formula is used in Sklearn's RidgeCV model, or Golub, Heath and Wahba (1978) for a statistical perspective. The presentation below is different but the gist in terms of the complexity gain is the same.
Tall case
Let us write where is the identity matrix with the th row removed. We then also have . It is not hard to show that where is an all-zero matrix with a 1 on the
The th Ridge solution verifies . Using the notations introduced in the previous paragraph, we can rewrite this as
If we write , then we know that it can be efficiently inverted assuming we have initially computed the SVD of . Note also that .
On the left-hand side of (8) we have therefore a rank-1 perturbation of an invertible matrix. The inverse of such a matrix can be readily expressed using the Sherman-Morrison formula (see the page on matrix inversion lemmas for details):
Plugging this in the lhs of (8), we can get an explicit expression for . But remember that what we really want is so that we're more interested in :
As it turns out, this can be simplified a fair bit. Note that the second factor can be re-written but that last term is simply ; let also which can be pre-computed and let . Then, massaging a bit, we get
and, correspondingly, .
To conclude, with , , and , we can rewrite the computation of as
and the cost of computing all the LOO errors is:
Operation | Dominating complexity |
---|---|
form and compute its SVD | |
pre-compute and | |
() compute | |
() compute | |
() compute |
So overall, the initial setup cost is dominated by and the subsequent cost is that of computing each : ; finally there's a cost where is the number of tested.
In other words, modulo constant factors, it costs the same to get all the LOO errors for different than to get a single Ridge solution!
Fat case
When , we already know that it's beneficial to consider the SVD of instead of that of . Here it turns out that just taking (12) and computing and in terms of and instead of is all we need to do; for this recall that so that :
Operation | Dominating complexity |
---|---|
form and compute its SVD | |
pre-compute | |
() compute | |
() compute | |
() compute |
So overall, the initial setup cost is dominated by and the subsequent cost is that of computing each : ; finally there's a cost where is the number of tested.
Again, modulo constant factors, it costs the same to get all the LOO errors for different than to get a single Ridge solution.
Simple code
We can easily check all this with code again, let's extend the previous case:
using InvertedIndices
i = 5 # random index between 1 and n
λ = 2.5
X̃ᵢ = X_fat[Not(i),:]
ỹᵢ = y_fat[Not(i),:]
β̃ᵢ = basic_ridge_solve(X̃ᵢ, ỹᵢ, λ)
xᵢ = vec(X_fat[i,:])
yᵢ = y_fat[i]
rᵢ = dot(xᵢ, β̃ᵢ) - yᵢ; # this the i-th LOO residual
Now let's show we can recover this using the previous point:
K = Symmetric(X_fat * X_fat')
F = svd(K)
U, Ut, S = F.V, F.Vt, F.S
Σ² = S
Σ = sqrt.(S)
w = (Ut * (K * y_fat)) ./ Σ
gᵢ = (Ut * (X_fat * xᵢ)) ./ Σ
g̃ᵢ = gᵢ ./ (Σ² .+ λ)
rᵢ ≈ (dot(g̃ᵢ, w) - yᵢ) / (1 - dot(g̃ᵢ, gᵢ))
true
Now if we change and ,
i = 7
λ = 5.5
X̃ᵢ = X_fat[Not(i),:]
ỹᵢ = y_fat[Not(i),:]
β̃ᵢ = basic_ridge_solve(X̃ᵢ, ỹᵢ, λ)
xᵢ = vec(X_fat[i,:])
yᵢ = y_fat[i]
rᵢ = dot(xᵢ, β̃ᵢ) - yᵢ;
we only have to recompute gᵢ
and g̃ᵢ
:
gᵢ = (Ut * (X_fat * xᵢ)) ./ Σ
g̃ᵢ = gᵢ ./ (Σ² .+ λ)
rᵢ ≈ (dot(g̃ᵢ, w) - yᵢ) / (1 - dot(g̃ᵢ, gᵢ))
true
Conclusion
In short, provided that computing the SVD of either a or matrix is manageable, it's not much more expensive to compute a single Ridge solution than to compute a bunch of solutions or to compute all the LOO errors for a number of different .
This LOO trick can be generalised to an extent to all leave-some-out schemes; this is covered in the second part.
References
Rifkin and Lippert, Notes on Regularized Least Squares, 2007. – Detailed development of a way to get the LOO error in (Kernel) Ridge efficiently, note that their development is not the same than the one presented here but the end result is comparable. Their formula is the one implemented in Sklearn's RidgeCV.
Golub, Heath and Wahba, Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter, 1978. – Introduction of the GCV parameter, appropriate for selecting in the tall case.