Mirror descent algorithm

From Euclidean to arbitrary projections

In the notes on the projected gradient descent (PGD) we showed that a way to solve the constrained minimisation problem with a differentiable ff is to follow the iteration

xk+1 ⁣ ⁣= ⁣ ⁣πC(xkαkf(xk)), x_{k+1} \quad\!\! =\quad\!\! \pi_C(x_k - \alpha_k \nabla f(x_k)),

where πC\pi_C denotes the Euclidean projection onto CC. By definition of the Euclidean projection, this can also be written as

xk+1 ⁣ ⁣= ⁣ ⁣argminxC (xkαkf(xk))x22= ⁣ ⁣argminxC (xxk)+αkf(xk)22.\begin{aligned} x_{k+1} \quad\!\!&=\quad\!\! \arg\min_{x\in C}\,\, \|(x_k-\alpha_k\nabla f(x_k))-x\|_2^2 \\ &=\quad\!\! \arg\min_{x\in C} \,\,\| (x-x_k) + \alpha_k\nabla f(x_k)\|_2^2.\end{aligned}

Now, since the 2\ell^2-norm is induced by the inner product (i.e. x,x=x22\left\langle x, x\right\rangle = \|x\|_2^2), we can use the decomposition

(xxk)+αkf(xk)22 ⁣ ⁣= ⁣ ⁣xxk22+2αkx,f(xk)+Mk, \|(x-x_k) + \alpha_k\nabla f(x_k)\|_2^2 \quad\!\! =\quad\!\! \|x-x_k\|_2^2 + 2\alpha_k\left\langle x, \nabla f(x_k)\right\rangle + M_k,

where MkM_k does not depend on xx (and can thus be ignored). Rearranging terms, we're then left with the equivalent iteration:

xk+1 ⁣ ⁣= ⁣ ⁣argminxC {x,f(xk)+1αkxxk222}. x_{k+1} \quad\!\! =\quad\!\! \arg\min_{x\in C}\,\, \left\{\left\langle x, \nabla f(x_k)\right\rangle + {1\over \alpha_k} {\|x-x_k\|_2^2\over 2}\right\}.

This new way of writing the PGD allows two important comments:

  1. the objective shows how the PGD corresponds to a tradeoff between following the direction of the negative gradient (first term) and not moving too much from the current point (second term) while staying in CC,

  2. the second term is an isotropic measure of distance from the current point xkx_k, but what if we used another measure of distance?

The first point might be obvious to you if you're already seen the gradient descent and related methods but it's actually a bit deeper than what it may seems on a cursory glance, I discuss this more in thoughts on first order methods.

The second point gives the generalised projected gradient descent.

The generalised projected gradient descent (GPGD) is defined for any positive-definite function d:Rn×RnR+d:\mathbb R^n\times\mathbb R^n\mapsto\mathbb R^+ by the following iteration:

xk+1 ⁣ ⁣= ⁣ ⁣argminxC {x,f(xk)+1αkd(x,xk)}. x_{k+1} \quad\!\! =\quad\!\! \arg\min_{x\in C}\,\, \left\{\left\langle x, \nabla f(x_k)\right\rangle + {1\over \alpha_k} d(x, x_k)\right\}.

Those positive-definite functions are sometimes called divergences and verify d(u,v)>0d(u, v)>0 for uvu\neq v and d(u,u)=0d(u, u)=0.

One reason why one might want to consider another distance-like function to penalise how much we move in a particular direction is that doing so can better reflect what we may know about the geometry of CC which can make steps easier to compute or accelerate convergence to a minimiser.

For a given divergence, there now remains to compute the corresponding iteration steps which may be intractable depending on dd. But there happens to be a popular class of divergences for which it is tractable and which we have already discussed: the Bregman divergences.

Bregman divergences and the mirror descent algorithm

Recall that for a strictly convex and differentiable function ψ\psi, we can define the Bregman divergence associated with ψ\psi as a positive-definite function BψB_\psi with

Bψ(x,y) ⁣ ⁣:= ⁣ ⁣ψ(x)ψ(y)xy,ψ(y). B_\psi(x, y) \quad\!\! :=\quad\!\! \psi(x)-\psi(y)-\left\langle x - y, \nabla \psi(y)\right\rangle.

Let us now consider a μ\mu-strongly convex and differentiable function φ\varphi and the associated Bregman divergence BφB_\varphi. If we use this as divergence in (5), the GPGD iteration is

xk+1 ⁣ ⁣ ⁣ ⁣argminxC {x,f(xk)+1αkBφ(x,xk)} x_{k+1} \quad\!\! \in\quad\!\! \arg \min_{x\in C} \,\,\left\{ \left\langle x, \nabla f(x_k)\right\rangle + {1\over \alpha_k}B_\varphi(x, x_k)\right\}

for αk>0\alpha_k>0. It's straightforward to take the subgradient of the objective and write the FOC for one step:

0 ⁣ ⁣ ⁣ ⁣αkf(xk)+φ(xk+1)φ(xk)+NC(xk+1), 0 \quad\!\! \in\quad\!\! \alpha_k\nabla f(x_k) + \nabla \varphi(x_{k+1}) - \nabla \varphi (x_k) + N_C(x_{k+1}),

which, after rearranging terms, reads

xk+1 ⁣ ⁣ ⁣ ⁣(φ+NC)1(φ(xk)αkf(xk)). x_{k+1} \quad\!\! \in\quad\!\! (\nabla\varphi + N_C)^{-1} (\nabla \varphi(x_k) - \alpha_k \nabla f(x_k)).

To complete this, observe that (φ+NC)(φ+iC)(\nabla \varphi + N_C) \equiv \partial (\varphi + i_C) so that letting ϕφ+iC\phi \equiv \varphi + i_C, and using that for hΓ0h\in \Gamma_0, (h)1h(\partial h)^{-1}\equiv\partial h^\star (cf. convex analysis part 2), we end up with the mirror descent algorithm.

The mirror descent algorithm (MDA) is the iteration

xk+1 ⁣ ⁣= ⁣ ⁣ϕ(φ(xk)αkf(xk)) x_{k+1} \quad\!\! =\quad\!\! \nabla\phi^\star (\nabla \varphi(x_k) - \alpha_k\nabla f(x_k))

where ϕ(y)=supzC[z,yφ(z)]\phi^\star(y) = \sup_{z\in C}[\left\langle z,y\right\rangle-\varphi(z)].

Note that ϕ\phi is also strongly convex on CC so that it is differentiable which is why we can write ϕ\nabla \phi^\star (its gradient is even Lipschitz as we showed in convex analysis part 3).

A final note is that, as for the PGD, the differentiability of ff can be relaxed to sub-differentiability without much changes, the iteration is then xk+1=ϕ(φ(xk)αkf(xk))x_{k+1}=\nabla\phi^\star(\nabla \varphi(x_k)-\alpha_k f'(x_k)) with f(xk)f(xk)f'(x_k)\in \partial f(x_k).

References

  1. Beck and Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization, 2003. – The paper behind the MDA, it also presents a convergence analysis and gives an example of application.

  2. Nemirovski, Tutorial: mirror descent algorithms for large-scale deterministic and stochastic convex optimization, 2012. – A deck of slides from a 2012 presentation, covering the MDA as well as applications.