Projected gradient descent

optimisation, projected gradient descent

Here we will show a general method to approach a constrained minimisation problem of a convex, differentiable function $f$ over a closed convex set $C\subset \mathbb R^n$ . Such problems can be written in an unconstrained form as we discussed in the introduction

\begin{array}{rcl} \min_x\quad\!\! f(x) + i_C(x),\end{array}

where $i_C$ is the indicator function associated with the set.

To get to the method, we will need three main parts:

discuss the first order condition (FOC) of the problem which will bring us to the notion of normal cone,
discuss how the normal cone relates to the Euclidean projection,
introduce the projected gradient descent algorithm.

First order condition
Normal cone and Euclidean projection
Projected gradient descent
1. The algorithm
Additional references

First order condition

The FOC indicates that $x^\dagger \in C$ is a minimiser if and only if $0\in \partial(f + i_C)(x^\dagger)$ . However, remember that the subdifferential of a sum contains the sum of subdifferentials (see convex analysis pt. 2) so that if $0 \in \nabla f(x^\dagger) + \partial i_C(x^\dagger)$ then $x^\dagger$ is a minimiser.

If $y$ is a subgradient of $i_C$ at a point $x\in C$ then, by definition,

\begin{array}{rcl} i_C(z) &\ge& i_C(x) + \left\langle z-x, y\right\rangle, \quad\forall z.\end{array}

For any $z\notin C$ , this inequality is trivially verified since the left hand side is infinite. Letting $z\in C$ and since $x\in C$ so that $i_C(x)=i_C(z)=0$ , we are left with

\begin{array}{rcl} 0 &\ge& \left\langle z-x, y\right\rangle, \quad\forall z\in C.\end{array}

This inequality defines the subdifferential $\partial i_C$ but also happens to be the definition of the normal cone to the convex set $C$ at $x$ denoted $N_C(x)$ . In short: $\begin{array}{rcl} \partial i_C(x) &=& \{y\in \mathbb R^n \,\mid\, \left\langle z-x, y\right\rangle\le 0, \forall z\in C\}\quad\!\! =\quad\!\! N_C(x).\end{array}$ Bringing the pieces together, we have

0 \,\in\, \nabla f(x^\dagger) + N_C(x^\dagger) \quad\!\! \Longrightarrow\quad\!\! x^\dagger \,\in\,\arg\min_{x\in C}\, f(x).

Of course this doesn't really help much because at this point we don't know how to find a $x^\dagger$ such that the left-hand-side holds. This is what the next section will explore.

Normal cone and Euclidean projection

A useful (and obvious) property of the normal cone which we shall use in the sequel is that it is invariant under non-negative scalar multiplication, i.e.:

\begin{array}{rcl} \mathrm{if}\,\, u\in N_C(x)\,\,\mathrm{then}\,\, \alpha u\in N_C(x), \,\forall \alpha \ge 0. \end{array}

We will now show the connection between the normal cone and the Euclidean projection on $C$ . Remember that the Euclidean projection onto $C$ (denoted $\pi_C$ ) is defined as follows for any $z\in \mathbb R^n$ :

\pi_C(z) \quad\!\! =\quad\!\! \arg\min_{x\in C}\,\, \frac12\|x-z\|_2^2

Note that in the unconstrained problem, $C=\mathbb R^n$ and hence $\pi_C=\mathbb I$ , the identity operator. Note also that the factor $1/2$ is here for aesthetics when computing the gradient which is then just $(x-z)$ .

Considering the condition (5) for the optimisation problem (7), we have that if

\begin{array}{rcl} 0\,\in\, (x^\dagger-z) + N_C(x^\dagger)\end{array}

then $x^\dagger$ is a minimiser with thus $x^\dagger = \pi_C(z)$ . This is equivalent to $z \in x^\dagger + N_C(x^\dagger)$ or $z\in (\mathbb I + N_C)(x^\dagger)$ . Replacing $x^\dagger$ by $\pi_C(z)$ and re-arranging terms gives

\begin{array}{rcl} \pi_C(z) &=& (\mathbb I + N_C)^{-1}(z).\end{array}

This is a classical and important relationship.

Let

C

denote a convex subset of

\mathbb R^n

then

\begin{array}{rcl} \pi_C &\equiv& (\mathbb I+ N_C)^{-1}. \end{array}

Projected gradient descent

We indicated at (6) that for any $\alpha \ge 0$ and $u\in N_C(x)$ , then $\alpha u\in N_C(x)$ . We can rewrite this as:

\begin{array}{rcl} (x+\alpha u) - x &\in& N_C(x), \end{array}

which may seem pointless but will lead us to a fixed-point algorithm (remember from the introduction that many algorithms for convex optimisation can be expressed in terms of fixed point algorithms).

Let $z:=(x+\alpha u)$ so that (11) can be rearranged to $z\in (\mathbb I+N_C)(x)$ . By (10), this is equivalent to $x=\pi_C(z)$ or

\begin{array}{rcl} x &=& \pi_C(x+\alpha u), \quad (\alpha \ge 0, \,u\in N_C(x)). \end{array}

To finish up, let's go back once more to the FOC (5) which indicates that if $x^\dagger$ is such that $-\nabla f(x^\dagger) \in N_C(x^\dagger)$ then it is a minimiser. Combined with (12), we get the following useful fixed-point form for the minimiser of the constrained problem:

x^\dagger \quad\!\! =\quad\!\! \pi_C(x^\dagger - \alpha \nabla f(x^\dagger)).

Note: had we not assumed that $f$ was differentiable on $C$ , we would still have that there exists a subgradient $f'(x^\dagger) \in \partial f(x^\dagger)$ with $-f'(x^\dagger)\in N_C(x^\dagger)$ and consequently the fixed-point $x^\dagger = \pi_C(x^\dagger - \alpha f'(x^\dagger))$ leading the projected subgradient method. In practice however, one tries to cast a convex objective function as a sum of a smooth function with a non-smooth one, to follow a gradient descent on the smooth part and use a projection for the non-smooth part (this is the case here with $i_C$ being the non-smooth part).

The algorithm

Well there's not much left to do. Applying a fixed-point iteration to (13) from a starting point $x_0$ generates the sequence $\{x_0, x_1, \dots\}$ via

\begin{array}{rcl} x_{k+1} &=& \pi_C(x_k - \alpha_k \nabla f(x_k))\end{array}

where $\alpha_k > 0$ is the step size (ignoring the trivial case $\alpha_k=0$ ). This is the projected gradient descent method.

Assuming that the $\alpha_k$ are picked sensibly and basic regularity conditions on the problem are met, the method enjoys a convergence rate $(f(x_k)-f(x^\dagger)) = \mathcal O(k^{-1})$ (see references for more).

Note: this method is pretty easy to apply provided you can compute the projection. Of course, there may be situations for which computing the projection is as hard as the initial problem! But there are many special cases where efficient projection can be applied (e.g. if $C$ is a polyhedron, i.e. corresponds to a set of $x$ such that $Ax\le b$ for some matrix $A$ and vector $b$ ). See for example Liu and Ye (2009).

Additional references

Burke, The Gradient Projection Algorithm, 2014. Lecture notes at the University of Washington covering the topic in a bit more depth.
Saunders, Notes on First-Order Methods for Minimizing Smooth Functions, 2017. – Lectures notes at Stanford covering the topic (among others) and proving the linear convergence rate under regularity conditions.
Liu, Ye, Efficient Euclidean Projections in Linear Time, 2009. – A paper describing the projection problem and how it can be done efficiently in some cases.

See also the general references mentioned in the introduction.