Projected gradient descent

Here we will show a general method to approach a constrained minimisation problem of a convex, differentiable function ff over a closed convex set CRnC\subset \mathbb R^n. Such problems can be written in an unconstrained form as we discussed in the introduction

minx ⁣ ⁣f(x)+iC(x),\begin{array}{rcl} \min_x\quad\!\! f(x) + i_C(x),\end{array}

where iCi_C is the indicator function associated with the set.

To get to the method, we will need three main parts:

  1. discuss the first order condition (FOC) of the problem which will bring us to the notion of normal cone,

  2. discuss how the normal cone relates to the Euclidean projection,

  3. introduce the projected gradient descent algorithm.

First order condition

The FOC indicates that xCx^\dagger \in C is a minimiser if and only if 0(f+iC)(x)0\in \partial(f + i_C)(x^\dagger). However, remember that the subdifferential of a sum contains the sum of subdifferentials (see convex analysis pt. 2) so that if 0f(x)+iC(x)0 \in \nabla f(x^\dagger) + \partial i_C(x^\dagger) then xx^\dagger is a minimiser.

If yy is a subgradient of iCi_C at a point xCx\in C then, by definition,

iC(z)iC(x)+zx,y,z.\begin{array}{rcl} i_C(z) &\ge& i_C(x) + \left\langle z-x, y\right\rangle, \quad\forall z.\end{array}

For any zCz\notin C, this inequality is trivially verified since the left hand side is infinite. Letting zCz\in C and since xCx\in C so that iC(x)=iC(z)=0i_C(x)=i_C(z)=0, we are left with

0zx,y,zC.\begin{array}{rcl} 0 &\ge& \left\langle z-x, y\right\rangle, \quad\forall z\in C.\end{array}

This inequality defines the subdifferential iC\partial i_C but also happens to be the definition of the normal cone to the convex set CC at xx denoted NC(x)N_C(x). In short: iC(x)={yRn zx,y0,zC} ⁣ ⁣= ⁣ ⁣NC(x).\begin{array}{rcl} \partial i_C(x) &=& \{y\in \mathbb R^n \,\mid\, \left\langle z-x, y\right\rangle\le 0, \forall z\in C\}\quad\!\! =\quad\!\! N_C(x).\end{array} Bringing the pieces together, we have

0 f(x)+NC(x) ⁣ ⁣ ⁣ ⁣x argminxC f(x). 0 \,\in\, \nabla f(x^\dagger) + N_C(x^\dagger) \quad\!\! \Longrightarrow\quad\!\! x^\dagger \,\in\,\arg\min_{x\in C}\, f(x).

Of course this doesn't really help much because at this point we don't know how to find a xx^\dagger such that the left-hand-side holds. This is what the next section will explore.

Normal cone and Euclidean projection

A useful (and obvious) property of the normal cone which we shall use in the sequel is that it is invariant under non-negative scalar multiplication, i.e.:

if uNC(x) then αuNC(x), α0.\begin{array}{rcl} \mathrm{if}\,\, u\in N_C(x)\,\,\mathrm{then}\,\, \alpha u\in N_C(x), \,\forall \alpha \ge 0. \end{array}

We will now show the connection between the normal cone and the Euclidean projection on CC. Remember that the Euclidean projection onto CC (denoted πC\pi_C) is defined as follows for any zRnz\in \mathbb R^n:

πC(z) ⁣ ⁣= ⁣ ⁣argminxC 12xz22 \pi_C(z) \quad\!\! =\quad\!\! \arg\min_{x\in C}\,\, \frac12\|x-z\|_2^2

Note that in the unconstrained problem, C=RnC=\mathbb R^n and hence πC=I\pi_C=\mathbb I, the identity operator. Note also that the factor 1/21/2 is here for aesthetics when computing the gradient which is then just (xz)(x-z).

Considering the condition (5) for the optimisation problem (7), we have that if

0 (xz)+NC(x)\begin{array}{rcl} 0\,\in\, (x^\dagger-z) + N_C(x^\dagger)\end{array}

then xx^\dagger is a minimiser with thus x=πC(z)x^\dagger = \pi_C(z). This is equivalent to zx+NC(x)z \in x^\dagger + N_C(x^\dagger) or z(I+NC)(x)z\in (\mathbb I + N_C)(x^\dagger). Replacing xx^\dagger by πC(z)\pi_C(z) and re-arranging terms gives

πC(z)=(I+NC)1(z).\begin{array}{rcl} \pi_C(z) &=& (\mathbb I + N_C)^{-1}(z).\end{array}

This is a classical and important relationship.

Let CC denote a convex subset of Rn\mathbb R^n then πC(I+NC)1.\begin{array}{rcl} \pi_C &\equiv& (\mathbb I+ N_C)^{-1}. \end{array}

Projected gradient descent

We indicated at (6) that for any α0\alpha \ge 0 and uNC(x)u\in N_C(x), then αuNC(x)\alpha u\in N_C(x). We can rewrite this as:

(x+αu)xNC(x),\begin{array}{rcl} (x+\alpha u) - x &\in& N_C(x), \end{array}

which may seem pointless but will lead us to a fixed-point algorithm (remember from the introduction that many algorithms for convex optimisation can be expressed in terms of fixed point algorithms).

Let z:=(x+αu)z:=(x+\alpha u) so that (11) can be rearranged to z(I+NC)(x)z\in (\mathbb I+N_C)(x). By (10), this is equivalent to x=πC(z)x=\pi_C(z) or

x=πC(x+αu),(α0, uNC(x)).\begin{array}{rcl} x &=& \pi_C(x+\alpha u), \quad (\alpha \ge 0, \,u\in N_C(x)). \end{array}

To finish up, let's go back once more to the FOC (5) which indicates that if xx^\dagger is such that f(x)NC(x)-\nabla f(x^\dagger) \in N_C(x^\dagger) then it is a minimiser. Combined with (12), we get the following useful fixed-point form for the minimiser of the constrained problem:

x ⁣ ⁣= ⁣ ⁣πC(xαf(x)). x^\dagger \quad\!\! =\quad\!\! \pi_C(x^\dagger - \alpha \nabla f(x^\dagger)).

Note: had we not assumed that ff was differentiable on CC, we would still have that there exists a subgradient f(x)f(x)f'(x^\dagger) \in \partial f(x^\dagger) with f(x)NC(x)-f'(x^\dagger)\in N_C(x^\dagger) and consequently the fixed-point x=πC(xαf(x))x^\dagger = \pi_C(x^\dagger - \alpha f'(x^\dagger)) leading the projected subgradient method. In practice however, one tries to cast a convex objective function as a sum of a smooth function with a non-smooth one, to follow a gradient descent on the smooth part and use a projection for the non-smooth part (this is the case here with iCi_C being the non-smooth part).

The algorithm

Well there's not much left to do. Applying a fixed-point iteration to (13) from a starting point x0x_0 generates the sequence {x0,x1, }\{x_0, x_1, \dots\} via

xk+1=πC(xkαkf(xk))\begin{array}{rcl} x_{k+1} &=& \pi_C(x_k - \alpha_k \nabla f(x_k))\end{array}

where αk>0\alpha_k > 0 is the step size (ignoring the trivial case αk=0\alpha_k=0). This is the projected gradient descent method.

Assuming that the αk\alpha_k are picked sensibly and basic regularity conditions on the problem are met, the method enjoys a convergence rate (f(xk)f(x))=O(k1)(f(x_k)-f(x^\dagger)) = \mathcal O(k^{-1}) (see references for more).

Note: this method is pretty easy to apply provided you can compute the projection. Of course, there may be situations for which computing the projection is as hard as the initial problem! But there are many special cases where efficient projection can be applied (e.g. if CC is a polyhedron, i.e. corresponds to a set of xx such that AxbAx\le b for some matrix AA and vector bb). See for example Liu and Ye (2009).

Additional references

  1. Burke, The Gradient Projection Algorithm, 2014. Lecture notes at the University of Washington covering the topic in a bit more depth.

  2. Saunders, Notes on First-Order Methods for Minimizing Smooth Functions, 2017. – Lectures notes at Stanford covering the topic (among others) and proving the linear convergence rate under regularity conditions.

  3. Liu, Ye, Efficient Euclidean Projections in Linear Time, 2009. – A paper describing the projection problem and how it can be done efficiently in some cases.

See also the general references mentioned in the introduction.