CV Ridge (K-Folds)

Where in the first part, we showed that the LOOCV trick could be obtained by using the Sherman-Morrison formula for the inversion of a rank-1 perturbation of an invertible matrix, in the general case we want to sequentially drop more than one point. In other words, we would like to consider the case where we consider not just a single rank-1 but a sum of rank-1 perturbations.

In these notes we investigate using Sherman-Morrison recursively and discuss in which circumstances this makes sense.

Prelims

When considering a leave-some-out scheme (e.g. K-folds), we consider Ridge problems with data XSX_{S} and ySy_{S} where S{1,,n}S\subset \{1,\dots,n\} is a set of rows to drop. As in the first part, we can write XS=ΩSXX_{S} = Ω_{S}X and yS=ΩSyy_S = Ω_{S}y where ΩSΩ_{S} is the identity matrix with the rows in SS removed. Much like in the first part, ΩStΩS=(IDS)\Omega_{S}^t\Omega_{S} = (I-D_{S}) where DSD_{S} is a zero matrix with 1 on the diagonal for indices in SS. Observe also that:

XtDSX ⁣ ⁣= ⁣ ⁣iSxixit X^t D_{S} X \quad\!\! =\quad\!\! \sum_{i ∈ S} x_ix_i^t

and that XtDSy=ziSyixiX^tD_S y = z - \sum_{i∈S} y_ix_i where z=Xtyz=X^ty.

The solution of the Ridge regression problem corresponding to the subset SS is therefore

βS=(HiSxixit)1(ziSyixi), β_S \quad=\quad \left(H - \sum_{i\in S} x_ix_i^t\right)^{-1} \left(z - \sum_{i\in S} y_i x_i\right),

where H=(XtX+λI)=V(Σ2+λI)VtH = (X^tX + λI) = V(Σ^2+λI)V^t if X=UΣVtX=UΣV^t.

As before, we care about the prediction error on the points that were removed: the vectors eSe_S with entries

(eS)i=xjtβSyj,wherej=Si. (e_S)_i = x_j^t β_S - y_j, \quad\text{where}\quad j = S_i.

We now turn our attention to the efficient computation of the matrix inversion in (2).

Recursive Sherman-Morrison (RSM)

Let us compute, step-by-step, the inversion of (Hx1x1tx2x2t)(H - x_1 x_1^t - x_2x_2^t) and show how this can be generalised. Let H1=Hx1x1tH_1 = H - x_1 x_1^t and H2=H1x2x2tH_2 = H_1 - x_2x_2^t. Then, Sherman-Morrison gives

H11=H1+H1x1x1tH11x1tH1x1,H21=H11+H11x2x2tH111x2tH11x2.\begin{array}{rcl} H_1^{-1} &=& \displaystyle H^{-1} + {H^{-1}x_1 x_1^t H^{-1} \over 1 - x_1^t H^{-1} x_1}, \\ H_2^{-1} &=&\displaystyle H_1^{-1} + {H_1^{-1}x_2 x_2^t H_1^{-1} \over 1 - x_2^t H_1^{-1} x_2}.\end{array}

Let us now write b1=H1x1b_1 = H^{-1}x_1, γ1=x1tb1γ_1 = x_1^t b_1, b2=H11x2b_2=H_1^{-1}x_2 and γ2=x2tb2γ_2=x_2^tb_2 then:

H11=H1+b1b1t1γ1,H21=H1+b1b1t1γ1+b2b2t1γ2.\begin{array}{rcl} H_1^{-1} &=& \displaystyle H^{-1} + {b_1 b_1^t \over 1 - γ_1}, \\ H_2^{-1} &=&\displaystyle H^{-1} + {b_1 b_1^t \over 1 - γ_1} + {b_2 b_2^t \over 1 - γ_2}.\end{array}

It's straightforward to generalise this:

(HiSxixit)1=H1+iSbibit1γi \left( H - \sum_{i\in S} x_ix_i^t\right)^{-1} \quad=\quad H^{-1} + \sum_{i\in S} {b_ib_i^t\over 1-γ_i}

which can be computed recursively interlacing the computations for bib_i and γiγ_i and the computation of the next term in the sum.

Complexity of the recursion

Let M=SM=|S| the number of rows dropped per selection/fold. Then there are MM terms in the sum (6). Each bib_i is given by the application of a p×pp\times p matrix on a vector. So, inherently, the complexity of this recursive inverse is O(Mp2)O(Mp^2) and it shouldn't be used if MM is of order comparable to pp.

In K-folds cross validation, M=O(n/K)M = O(n/K) where KK might typically be around 55 or 1010 so that MM may very well be of order pp or larger. We will come back to this in the next section.

Applying RSM in the tall case

One unfortunate element of RSM is that it is not possible (to the best of my knowledge) to easily get the inverse of the matrix for several λλ without recomputing the recursion each time. This is because the computation of each bib_i after the initial one will involve non-diagonal matrices that are shaped by λ\lambda. Note that in the LOO case, this was not the case because we only had one diagonal matrix to update which is O(p)O(p) (i.e. negligible).

The observation above means that for a single selection/fold, we have to pay O(κMp2)O(κMp^2) if κκ is the number of λλ we want to test. Assuming we do KK folds/selections, the overall cost is therefore O(κKMp2)O(κKMp^2) to form all of the eSe_S vectors.

K-folds CV

In K-folds CV, MM is O(n/K)O(n/K) and the total cost with RSM is thus O(κnp2)O(κnp^2).

If, instead, we were to re-compute an SVD of XStXSX_S^tX_S per fold and recycle the computations for the λλ, the cost per fold is O((nM+κ)p2)O((n - M + κ)p^2) or O(Knp2)O(Knp^2) overall assuming that κκ is dominated by nn.

Since, in general, we are much more likely to want to test many λλ (large κκ) with a few folds (with KK of order 55-1010), doing re-computations per fold makes more sense than applying RSM.

Note: it is still important to recycle computations otherwise we pay O(κKnp2)O(κKnp^2) by doing a naive grid-computations for every fold and every λ\lambda; surprisingly this seems to be what is currently done in SkLearn as per these lines of code.

Fat case

Since the complexity of RSM is O(Mp2)O(Mp^2), it shouldn't be considered. Computing KK problem with nMn-M rows and recycling computations for the λλ is best.

Conclusion

So disappointingly the generalisation of the LOO trick to the more general case is not particularly useful compared to simply re-computing things per-folds and recycling computations for the λλ. This could maybe have been expected, we have to deal with an expansion in MM terms where MM scales like nn in KK folds with low KK (the usual case).