Thoughts on first order descent methods

optimisation, projected gradient descent, prospective

This is an experimental set of notes, take it with appropriate care.

First order methods (FOM) broadly designate iterative methods for continuous and (sub)differentiable optimisation that mainly use information from the (sub)gradient of the function.

In these notes we consider again the constrained minimisation problem $\min_{x\in C} f(x)$ and, to simplify the presentation, we'll assume that $C$ is closed and convex and that $f$ is strictly convex and smooth on $C$ .

What we're interested in, at a high level, is how to generate a minimising sequence for $f$ i.e., a sequence $\{x_k\}$ with $x_k\in C$ such that $f(x_{k+1}) < f(x_k)$ and $f(x_k) \to f(x^\dagger)$ as $k$ grows.

Remark: all norms $\|\cdot\|$ on this page are 2-norms.

Local linearisation
Admissible descent steps
Local update schemes
1. GPGD is back
2. Choice of distance and problem structure
References

Local linearisation

Local linearisation is basically what first order methods are about: form a linear approximation of the function around a current point and use that to move towards a promising direction. Still, let's try to look at this from scratch.

Consider the local linearisation of $f$ around a point $a\in C^\circ$ :

f(x) \quad\!\! =\quad\!\! f(a) + \left\langle x-a, \nabla f(a)\right\rangle + r(x, a)

where $r(x, a)$ is the remainder function.

Note: here you may be thinking: Taylor expansion. But let's actually put that on the side for now and just assume that you're building this $f(a)+\left\langle x-a,\nabla f(a)\right\rangle$ and that you want to investigate the properties of the remainder function.

That function enjoys the following properties:

$r(a, a)=0$ , and $r(x, a)>0$ for all $x\neq a$ by strict convexity of $f$ ,
$r(\cdot, a)$ is also strictly convex and smooth for all $a\in C^\circ$ .

Let us introduce a more compact notation: $r_a(\delta) := r(a+\delta, a)$ which will be quite useful. It is easy to note that $r_a$ is smooth and strictly convex with $r_a(0)=0$ and, in fact, is globally minimised at $0$ . By definition of strict convexity, we have

r_a(\delta') \quad\!\! >\quad\!\! r_a(\delta) + \left\langle \delta'-\delta, \nabla r_a(\delta)\right\rangle, \quad \forall\delta' \neq \delta.

Taking $\delta'=0$ and rearranging terms yields

r_a(\delta) \quad\!\! <\quad\!\! \left\langle \delta, \nabla r_a(\delta)\right\rangle \quad\!\! \le\quad\!\! \|\delta\|\|\nabla r_a(\delta)\|,

using Cauchy-Schwartz for the second inequality. Rearranging again gives $r_a(\delta)/\|\delta\| < \|\nabla r_a(\delta)\|$ for all $\delta\neq 0$ . Since $r_a$ is smooth away from $0$ and since $\nabla r_a(0)=0$ (minimiser), we can take the limit as $\|\delta\|\to 0$ and get the following result.

The function $r_a$ is $o(\|\delta\|)$ meaning

\lim_{\|\delta\|\to 0} {r_a(\delta)\over \|\delta\|} \quad\!\! =\quad\!\! 0.

This could have been obtained directly from the Taylor expansion of $f$ but I think it's nice to obtain it using only notions of convexity. Another note is that the reasoning essentially still holds if $f$ is only convex and sub-differentiable.

Admissible descent steps

Let's consider a point $x$ in $C$ and a step from $x$ to $x+\delta$ for some $\delta\neq 0$ . We're interested in determining what are "good" steps $\delta$ to take. Using the notations from the previous point, we have

f(x+\delta) \quad\!\! =\quad\!\! f(x)+\left\langle \delta, \nabla f(x)\right\rangle + r_x(\delta).

Such a step will be called an admissible descent step if $x+\delta\in C$ and if it decreases the function, i.e. if $f(x+\delta) < f(x)$ or:

\left\langle \delta, \nabla f(x)\right\rangle + r_x(\delta) \quad\!\! <\quad\!\! 0.

Let $\mathcal D_x$ be the set of admissible descent steps from $x$ ; it is non-empty provided that $0<\|\nabla f(x)\|<\infty$ . To show this, let $\delta_\epsilon := -\epsilon(g+v)$ with $\epsilon > 0$ and

$g=\nabla f(x)/\|\nabla f(x)\|$ (the unit vector in the direction of the gradient),
$v$ orthogonal to $g$ i.e. such that $\left\langle v, \nabla f(x)\right\rangle=0$ , and such that $0 < \|v\|\le 1$ .

Then just by plugging things in we have

\left\langle \delta_\epsilon, \nabla f(x)\right\rangle + r_x(\delta_\epsilon) \quad\!\! =\quad\!\! -\epsilon + r_x(\epsilon(g+v))

but recall that $r_x(\epsilon(g+v)) = o(\epsilon\|g+v\|)$ by (4). And since $g$ and $v$ are fixed, $r_x(\epsilon(g+v)) = o(\epsilon)$ . As a result, for sufficiently small $\epsilon$ , the right-hand-side is negative and the condition (6) holds for $\delta_\epsilon$ .

For sufficiently small

\epsilon

\delta_\epsilon=-\epsilon(g+v)

is an admissible descent step.

Note that, by construction, these $\delta_\epsilon$ span a half-ball of radius $\epsilon$ so that $\mathcal D_x$ is non-empty and also non-degenerate.

Let

w

be such that

0<\|w\|=:\eta

and

\left\langle w, \nabla f(x)\right\rangle<0

, then, provided

\eta

is small enough,

w\in\mathcal D_x

Obviously, what we would like is to get the best possible step:

\delta^\dagger \quad\!\! \in\quad\!\! \arg\min_{\delta \mid x+\delta \in C} \,\, \left[\left\langle \delta, \nabla f(x)\right\rangle+r_x(\delta)\right]

which leads directly to the minimiser $x^\dagger = x+\delta^\dagger$ . Of course that's a bit silly since solving (8) is as hard as solving the original problem. However the expression (8) will help us generate descent algorithms.

Local update schemes

What we would like is thus to consider a problem that is simpler than (8) and yet still generates an admissible descent direction (and iterate). A natural way to try to do just that is to replace $r_x(\delta)$ by a proxy function $d_x(\delta)$ enjoying the same properties of positive definiteness and strict convexity. The corresponding approximate problem is then

\tilde\delta_\beta \quad\!\! \in\quad\!\! \arg\min_{\delta \mid x+\delta \in C} \left[\left\langle \delta, \nabla f(x)\right\rangle + \beta d_x(\delta)\right].

Let's now show that these problems can lead to admissible descent steps for the original problem. We can follow a similar reasoning to that which led us to show the non-degeneracy of $\mathcal D_x$ . In particular, observe that as $\beta\to\infty$ , $\|\tilde\delta_{\beta}\|\to 0$ . Hence, there exists a $\beta^\bullet$ large enough such that for any $\nu \ge \beta^\bullet$ , $\|\tilde\delta_\nu\|$ is small enough for $\tilde\delta_\nu$ to be in $\mathcal D_x$ .

GPGD is back

Now that we know that (9) can lead to an admissible step, we can suggest iterating over the problem with a sequence of $\{\beta_k\}$ :

\tilde\delta_{\beta_k} \quad\!\! \in\quad\!\! \arg\min_{\delta\mid x_k+\delta \in C} \,\, \left[\left\langle \delta, \nabla f(x_k)\right\rangle + \beta_kd_x(\delta_k)\right].

However, basic manipulation of that expression show that this is in fact the generalised projected gradient descent (GPGD) that we saw before.

The generalised projected gradient descent (GPGD) corresponds to the following iteration:

x_{k+1} \quad\!\! \in\quad\!\! \arg\min_{x\in C} \left\{\left\langle x, \nabla f(x_k)\right\rangle + {1\over \alpha_k}d(x, x_k)\right\}

for some $\alpha_k>0$ and for any positive-definite function $d$ that is strictly convex in its first argument. It generates a minimising sequence for $f$ provided the $\alpha_k$ are small enough.

This may seem like it's not telling us that much, in particular it should be clear that you could pick the $\alpha_k$ infinitesimally small, that it would indeed give you a minimising sequence but also that it wouldn't bring you very far. So at this point there's two comments we can make:

ideal $\alpha_k$ encapsulate a tradeoff between leading to steps that are too big and may not be admissible an steps that are too small to provide useful improvement,
a key element that should hopefully be obvious by now corresponds to how we can interpret $d$ : if we know nothing about the function at hand, we can just use a default $\|\cdot\|_2^2$ but if we do know something useful about the function (and, in fact, about $C$ ), then that could be encoded in $d$ .

Choice of distance and problem structure

The second point is very important: it should be clear to you that you'd want the local problems to be as informed as possible while at the same time you'd want the iterations to not be overly expensive to compute, two extreme cases being:

$d(x, x_k)/\alpha_k = \|x-x_k\|_2^2/{2\alpha_k}$ and $\alpha_k$ small, the iterations are cheap to compute but potentially quite poor at decreasing the function, many steps are needed, minimal use of problem structure,
$d(x, x_k)/\alpha_k = r(x, x_k)$ , the iteration is maximally expensive but only a single step is needed; maximal use of problem structure.

This key tradeoff can be exposed in most iterative methods using gradient information that you'll find in the literature.

References

El Ghaoui, Interior-Point Methods, 2012. – Lecture notes at Berkeley, covering another topic (IPMs) but also summarising descent methods in general.