This is an experimental set of notes, take it with appropriate care.
First order methods (FOM) broadly designate iterative methods for continuous and (sub)differentiable optimisation that mainly use information from the (sub)gradient of the function.
In these notes we consider again the constrained minimisation problem and, to simplify the presentation, we'll assume that is closed and convex and that is strictly convex and smooth on .
What we're interested in, at a high level, is how to generate a minimising sequence for i.e., a sequence with such that and as grows.
Remark: all norms on this page are 2-norms.
Local linearisation is basically what first order methods are about: form a linear approximation of the function around a current point and use that to move towards a promising direction. Still, let's try to look at this from scratch.
Consider the local linearisation of around a point :
where is the remainder function.
Note: here you may be thinking: Taylor expansion. But let's actually put that on the side for now and just assume that you're building this and that you want to investigate the properties of the remainder function.
That function enjoys the following properties:
, and for all by strict convexity of ,
is also strictly convex and smooth for all .
Let us introduce a more compact notation: which will be quite useful. It is easy to note that is smooth and strictly convex with and, in fact, is globally minimised at . By definition of strict convexity, we have
Taking and rearranging terms yields
using Cauchy-Schwartz for the second inequality. Rearranging again gives for all . Since is smooth away from and since (minimiser), we can take the limit as and get the following result.
This could have been obtained directly from the Taylor expansion of but I think it's nice to obtain it using only notions of convexity. Another note is that the reasoning essentially still holds if is only convex and sub-differentiable.
Let's consider a point in and a step from to for some . We're interested in determining what are "good" steps to take. Using the notations from the previous point, we have
Such a step will be called an admissible descent step if and if it decreases the function, i.e. if or:
Let be the set of admissible descent steps from ; it is non-empty provided that . To show this, let with and
(the unit vector in the direction of the gradient),
orthogonal to i.e. such that , and such that .
Then just by plugging things in we have
but recall that by (4). And since and are fixed, . As a result, for sufficiently small , the right-hand-side is negative and the condition (6) holds for .
Note that, by construction, these span a half-ball of radius so that is non-empty and also non-degenerate.
Obviously, what we would like is to get the best possible step:
which leads directly to the minimiser . Of course that's a bit silly since solving (8) is as hard as solving the original problem. However the expression (8) will help us generate descent algorithms.
What we would like is thus to consider a problem that is simpler than (8) and yet still generates an admissible descent direction (and iterate). A natural way to try to do just that is to replace by a proxy function enjoying the same properties of positive definiteness and strict convexity. The corresponding approximate problem is then
Let's now show that these problems can lead to admissible descent steps for the original problem. We can follow a similar reasoning to that which led us to show the non-degeneracy of . In particular, observe that as , . Hence, there exists a large enough such that for any , is small enough for to be in .
Now that we know that (9) can lead to an admissible step, we can suggest iterating over the problem with a sequence of :
However, basic manipulation of that expression show that this is in fact the generalised projected gradient descent (GPGD) that we saw before.
The generalised projected gradient descent (GPGD) corresponds to the following iteration:
for some and for any positive-definite function that is strictly convex in its first argument. It generates a minimising sequence for provided the are small enough.
This may seem like it's not telling us that much, in particular it should be clear that you could pick the infinitesimally small, that it would indeed give you a minimising sequence but also that it wouldn't bring you very far. So at this point there's two comments we can make:
ideal encapsulate a tradeoff between leading to steps that are too big and may not be admissible an steps that are too small to provide useful improvement,
a key element that should hopefully be obvious by now corresponds to how we can interpret : if we know nothing about the function at hand, we can just use a default but if we do know something useful about the function (and, in fact, about ), then that could be encoded in .
The second point is very important: it should be clear to you that you'd want the local problems to be as informed as possible while at the same time you'd want the iterations to not be overly expensive to compute, two extreme cases being:
and small, the iterations are cheap to compute but potentially quite poor at decreasing the function, many steps are needed, minimal use of problem structure,
, the iteration is maximally expensive but only a single step is needed; maximal use of problem structure.
This key tradeoff can be exposed in most iterative methods using gradient information that you'll find in the literature.
El Ghaoui, Interior-Point Methods, 2012. – Lecture notes at Berkeley, covering another topic (IPMs) but also summarising descent methods in general.