Where in the first part, we showed that the LOOCV trick could be obtained by using the Sherman-Morrison formula for the inversion of a rank-1 perturbation of an invertible matrix, in the general case we want to sequentially drop more than one point. In other words, we would like to consider the case where we consider not just a single rank-1 but a sum of rank-1 perturbations.
In these notes we investigate using Sherman-Morrison recursively and discuss in which circumstances this makes sense.
When considering a leave-some-out scheme (e.g. K-folds), we consider Ridge problems with data and where is a set of rows to drop. As in the first part, we can write and where is the identity matrix with the rows in removed. Much like in the first part, where is a zero matrix with 1 on the diagonal for indices in . Observe also that:
and that where .
The solution of the Ridge regression problem corresponding to the subset is therefore
where if .
As before, we care about the prediction error on the points that were removed: the vectors with entries
We now turn our attention to the efficient computation of the matrix inversion in (2).
Let us compute, step-by-step, the inversion of and show how this can be generalised. Let and . Then, Sherman-Morrison gives
Let us now write , , and then:
It's straightforward to generalise this:
which can be computed recursively interlacing the computations for and and the computation of the next term in the sum.
Let the number of rows dropped per selection/fold. Then there are terms in the sum (6). Each is given by the application of a matrix on a vector. So, inherently, the complexity of this recursive inverse is and it shouldn't be used if is of order comparable to .
In K-folds cross validation, where might typically be around or so that may very well be of order or larger. We will come back to this in the next section.
One unfortunate element of RSM is that it is not possible (to the best of my knowledge) to easily get the inverse of the matrix for several without recomputing the recursion each time. This is because the computation of each after the initial one will involve non-diagonal matrices that are shaped by . Note that in the LOO case, this was not the case because we only had one diagonal matrix to update which is (i.e. negligible).
The observation above means that for a single selection/fold, we have to pay if is the number of we want to test. Assuming we do folds/selections, the overall cost is therefore to form all of the vectors.
In K-folds CV, is and the total cost with RSM is thus .
If, instead, we were to re-compute an SVD of per fold and recycle the computations for the , the cost per fold is or overall assuming that is dominated by .
Since, in general, we are much more likely to want to test many (large ) with a few folds (with of order -), doing re-computations per fold makes more sense than applying RSM.
Note: it is still important to recycle computations otherwise we pay by doing a naive grid-computations for every fold and every ; surprisingly this seems to be what is currently done in SkLearn as per these lines of code.
Since the complexity of RSM is , it shouldn't be considered. Computing problem with rows and recycling computations for the is best.
So disappointingly the generalisation of the LOO trick to the more general case is not particularly useful compared to simply re-computing things per-folds and recycling computations for the . This could maybe have been expected, we have to deal with an expansion in terms where scales like in folds with low (the usual case).