Thoughts on first order descent methods

This is an experimental set of notes, take it with appropriate care.

First order methods (FOM) broadly designate iterative methods for continuous and (sub)differentiable optimisation that mainly use information from the (sub)gradient of the function.

In these notes we consider again the constrained minimisation problem minxCf(x)\min_{x\in C} f(x) and, to simplify the presentation, we'll assume that CC is closed and convex and that ff is strictly convex and smooth on CC.

What we're interested in, at a high level, is how to generate a minimising sequence for ff i.e., a sequence {xk}\{x_k\} with xkCx_k\in C such that f(xk+1)<f(xk)f(x_{k+1}) < f(x_k) and f(xk)f(x)f(x_k) \to f(x^\dagger) as kk grows.

Remark: all norms \|\cdot\| on this page are 2-norms.

Local linearisation

Local linearisation is basically what first order methods are about: form a linear approximation of the function around a current point and use that to move towards a promising direction. Still, let's try to look at this from scratch.

Consider the local linearisation of ff around a point aCa\in C^\circ:

f(x) ⁣ ⁣= ⁣ ⁣f(a)+xa,f(a)+r(x,a) f(x) \quad\!\! =\quad\!\! f(a) + \left\langle x-a, \nabla f(a)\right\rangle + r(x, a)

where r(x,a)r(x, a) is the remainder function.

Note: here you may be thinking: Taylor expansion. But let's actually put that on the side for now and just assume that you're building this f(a)+xa,f(a)f(a)+\left\langle x-a,\nabla f(a)\right\rangle and that you want to investigate the properties of the remainder function.

That function enjoys the following properties:

  1. r(a,a)=0r(a, a)=0, and r(x,a)>0r(x, a)>0 for all xax\neq a by strict convexity of ff,

  2. r(,a)r(\cdot, a) is also strictly convex and smooth for all aCa\in C^\circ.

Let us introduce a more compact notation: ra(δ):=r(a+δ,a)r_a(\delta) := r(a+\delta, a) which will be quite useful. It is easy to note that rar_a is smooth and strictly convex with ra(0)=0r_a(0)=0 and, in fact, is globally minimised at 00. By definition of strict convexity, we have

ra(δ) ⁣ ⁣> ⁣ ⁣ra(δ)+δδ,ra(δ),δδ. r_a(\delta') \quad\!\! >\quad\!\! r_a(\delta) + \left\langle \delta'-\delta, \nabla r_a(\delta)\right\rangle, \quad \forall\delta' \neq \delta.

Taking δ=0\delta'=0 and rearranging terms yields

ra(δ) ⁣ ⁣< ⁣ ⁣δ,ra(δ) ⁣ ⁣ ⁣ ⁣δra(δ), r_a(\delta) \quad\!\! <\quad\!\! \left\langle \delta, \nabla r_a(\delta)\right\rangle \quad\!\! \le\quad\!\! \|\delta\|\|\nabla r_a(\delta)\|,

using Cauchy-Schwartz for the second inequality. Rearranging again gives ra(δ)/δ<ra(δ)r_a(\delta)/\|\delta\| < \|\nabla r_a(\delta)\| for all δ0\delta\neq 0. Since rar_a is smooth away from 00 and since ra(0)=0\nabla r_a(0)=0 (minimiser), we can take the limit as δ0\|\delta\|\to 0 and get the following result.

The function rar_a is o(δ)o(\|\delta\|) meaning

limδ0ra(δ)δ ⁣ ⁣= ⁣ ⁣0. \lim_{\|\delta\|\to 0} {r_a(\delta)\over \|\delta\|} \quad\!\! =\quad\!\! 0.

This could have been obtained directly from the Taylor expansion of ff but I think it's nice to obtain it using only notions of convexity. Another note is that the reasoning essentially still holds if ff is only convex and sub-differentiable.

Admissible descent steps

Let's consider a point xx in CC and a step from xx to x+δx+\delta for some δ0\delta\neq 0. We're interested in determining what are "good" steps δ\delta to take. Using the notations from the previous point, we have

f(x+δ) ⁣ ⁣= ⁣ ⁣f(x)+δ,f(x)+rx(δ). f(x+\delta) \quad\!\! =\quad\!\! f(x)+\left\langle \delta, \nabla f(x)\right\rangle + r_x(\delta).

Such a step will be called an admissible descent step if x+δCx+\delta\in C and if it decreases the function, i.e. if f(x+δ)<f(x)f(x+\delta) < f(x) or:

δ,f(x)+rx(δ) ⁣ ⁣< ⁣ ⁣0. \left\langle \delta, \nabla f(x)\right\rangle + r_x(\delta) \quad\!\! <\quad\!\! 0.

Let Dx\mathcal D_x be the set of admissible descent steps from xx; it is non-empty provided that 0<f(x)<0<\|\nabla f(x)\|<\infty. To show this, let δϵ:=ϵ(g+v)\delta_\epsilon := -\epsilon(g+v) with ϵ>0\epsilon > 0 and

Then just by plugging things in we have

δϵ,f(x)+rx(δϵ) ⁣ ⁣= ⁣ ⁣ϵ+rx(ϵ(g+v))\left\langle \delta_\epsilon, \nabla f(x)\right\rangle + r_x(\delta_\epsilon) \quad\!\! =\quad\!\! -\epsilon + r_x(\epsilon(g+v))

but recall that rx(ϵ(g+v))=o(ϵg+v)r_x(\epsilon(g+v)) = o(\epsilon\|g+v\|) by (4). And since gg and vv are fixed, rx(ϵ(g+v))=o(ϵ)r_x(\epsilon(g+v)) = o(\epsilon). As a result, for sufficiently small ϵ\epsilon, the right-hand-side is negative and the condition (6) holds for δϵ\delta_\epsilon.

For sufficiently small ϵ\epsilon, δϵ=ϵ(g+v)\delta_\epsilon=-\epsilon(g+v) is an admissible descent step.

Note that, by construction, these δϵ\delta_\epsilon span a half-ball of radius ϵ\epsilon so that Dx\mathcal D_x is non-empty and also non-degenerate.

Let ww be such that 0<w=:η0<\|w\|=:\eta and w,f(x)<0\left\langle w, \nabla f(x)\right\rangle<0, then, provided η\eta is small enough, wDxw\in\mathcal D_x.

Obviously, what we would like is to get the best possible step:

δ ⁣ ⁣ ⁣ ⁣argminδx+δC [δ,f(x)+rx(δ)] \delta^\dagger \quad\!\! \in\quad\!\! \arg\min_{\delta \mid x+\delta \in C} \,\, \left[\left\langle \delta, \nabla f(x)\right\rangle+r_x(\delta)\right]

which leads directly to the minimiser x=x+δx^\dagger = x+\delta^\dagger. Of course that's a bit silly since solving (8) is as hard as solving the original problem. However the expression (8) will help us generate descent algorithms.

Local update schemes

What we would like is thus to consider a problem that is simpler than (8) and yet still generates an admissible descent direction (and iterate). A natural way to try to do just that is to replace rx(δ)r_x(\delta) by a proxy function dx(δ)d_x(\delta) enjoying the same properties of positive definiteness and strict convexity. The corresponding approximate problem is then

δ~β ⁣ ⁣ ⁣ ⁣argminδx+δC[δ,f(x)+βdx(δ)]. \tilde\delta_\beta \quad\!\! \in\quad\!\! \arg\min_{\delta \mid x+\delta \in C} \left[\left\langle \delta, \nabla f(x)\right\rangle + \beta d_x(\delta)\right].

Let's now show that these problems can lead to admissible descent steps for the original problem. We can follow a similar reasoning to that which led us to show the non-degeneracy of Dx\mathcal D_x. In particular, observe that as β\beta\to\infty, δ~β0\|\tilde\delta_{\beta}\|\to 0. Hence, there exists a β\beta^\bullet large enough such that for any νβ\nu \ge \beta^\bullet, δ~ν\|\tilde\delta_\nu\| is small enough for δ~ν\tilde\delta_\nu to be in Dx\mathcal D_x.

GPGD is back

Now that we know that (9) can lead to an admissible step, we can suggest iterating over the problem with a sequence of {βk}\{\beta_k\}:

δ~βk ⁣ ⁣ ⁣ ⁣argminδxk+δC [δ,f(xk)+βkdx(δk)]. \tilde\delta_{\beta_k} \quad\!\! \in\quad\!\! \arg\min_{\delta\mid x_k+\delta \in C} \,\, \left[\left\langle \delta, \nabla f(x_k)\right\rangle + \beta_kd_x(\delta_k)\right].

However, basic manipulation of that expression show that this is in fact the generalised projected gradient descent (GPGD) that we saw before.

The generalised projected gradient descent (GPGD) corresponds to the following iteration:

xk+1 ⁣ ⁣ ⁣ ⁣argminxC{x,f(xk)+1αkd(x,xk)} x_{k+1} \quad\!\! \in\quad\!\! \arg\min_{x\in C} \left\{\left\langle x, \nabla f(x_k)\right\rangle + {1\over \alpha_k}d(x, x_k)\right\}

for some αk>0\alpha_k>0 and for any positive-definite function dd that is strictly convex in its first argument. It generates a minimising sequence for ff provided the αk\alpha_k are small enough.

This may seem like it's not telling us that much, in particular it should be clear that you could pick the αk\alpha_k infinitesimally small, that it would indeed give you a minimising sequence but also that it wouldn't bring you very far. So at this point there's two comments we can make:

  1. ideal αk\alpha_k encapsulate a tradeoff between leading to steps that are too big and may not be admissible an steps that are too small to provide useful improvement,

  2. a key element that should hopefully be obvious by now corresponds to how we can interpret dd: if we know nothing about the function at hand, we can just use a default 22\|\cdot\|_2^2 but if we do know something useful about the function (and, in fact, about CC), then that could be encoded in dd.

Choice of distance and problem structure

The second point is very important: it should be clear to you that you'd want the local problems to be as informed as possible while at the same time you'd want the iterations to not be overly expensive to compute, two extreme cases being:

This key tradeoff can be exposed in most iterative methods using gradient information that you'll find in the literature.

References

  1. El Ghaoui, Interior-Point Methods, 2012. – Lecture notes at Berkeley, covering another topic (IPMs) but also summarising descent methods in general.