Mirror descent algorithm

mirror descent, optimisation, projected gradient descent

From Euclidean to arbitrary projections
Bregman divergences and the mirror descent algorithm
References

From Euclidean to arbitrary projections

In the notes on the projected gradient descent (PGD) we showed that a way to solve the constrained minimisation problem with a differentiable $f$ is to follow the iteration

x_{k+1} \quad\!\! =\quad\!\! \pi_C(x_k - \alpha_k \nabla f(x_k)),

where $\pi_C$ denotes the Euclidean projection onto $C$ . By definition of the Euclidean projection, this can also be written as

$\begin{aligned} x_{k+1} \quad\!\!&=\quad\!\! \arg\min_{x\in C}\,\, \|(x_k-\alpha_k\nabla f(x_k))-x\|_2^2 \\ &=\quad\!\! \arg\min_{x\in C} \,\,\| (x-x_k) + \alpha_k\nabla f(x_k)\|_2^2.\end{aligned}$

Now, since the $\ell^2$ -norm is induced by the inner product (i.e. $\left\langle x, x\right\rangle = \|x\|_2^2$ ), we can use the decomposition

\|(x-x_k) + \alpha_k\nabla f(x_k)\|_2^2 \quad\!\! =\quad\!\! \|x-x_k\|_2^2 + 2\alpha_k\left\langle x, \nabla f(x_k)\right\rangle + M_k,

where $M_k$ does not depend on $x$ (and can thus be ignored). Rearranging terms, we're then left with the equivalent iteration:

x_{k+1} \quad\!\! =\quad\!\! \arg\min_{x\in C}\,\, \left\{\left\langle x, \nabla f(x_k)\right\rangle + {1\over \alpha_k} {\|x-x_k\|_2^2\over 2}\right\}.

This new way of writing the PGD allows two important comments:

the objective shows how the PGD corresponds to a tradeoff between following the direction of the negative gradient (first term) and not moving too much from the current point (second term) while staying in $C$ ,
the second term is an isotropic measure of distance from the current point $x_k$ , but what if we used another measure of distance?

The first point might be obvious to you if you're already seen the gradient descent and related methods but it's actually a bit deeper than what it may seems on a cursory glance, I discuss this more in thoughts on first order methods.

The second point gives the generalised projected gradient descent.

The generalised projected gradient descent (GPGD) is defined for any positive-definite function $d:\mathbb R^n\times\mathbb R^n\mapsto\mathbb R^+$ by the following iteration:

x_{k+1} \quad\!\! =\quad\!\! \arg\min_{x\in C}\,\, \left\{\left\langle x, \nabla f(x_k)\right\rangle + {1\over \alpha_k} d(x, x_k)\right\}.

Those positive-definite functions are sometimes called divergences and verify $d(u, v)>0$ for $u\neq v$ and $d(u, u)=0$ .

One reason why one might want to consider another distance-like function to penalise how much we move in a particular direction is that doing so can better reflect what we may know about the geometry of $C$ which can make steps easier to compute or accelerate convergence to a minimiser.

For a given divergence, there now remains to compute the corresponding iteration steps which may be intractable depending on $d$ . But there happens to be a popular class of divergences for which it is tractable and which we have already discussed: the Bregman divergences.

Bregman divergences and the mirror descent algorithm

Recall that for a strictly convex and differentiable function $\psi$ , we can define the Bregman divergence associated with $\psi$ as a positive-definite function $B_\psi$ with

B_\psi(x, y) \quad\!\! :=\quad\!\! \psi(x)-\psi(y)-\left\langle x - y, \nabla \psi(y)\right\rangle.

Let us now consider a $\mu$ -strongly convex and differentiable function $\varphi$ and the associated Bregman divergence $B_\varphi$ . If we use this as divergence in (5), the GPGD iteration is

x_{k+1} \quad\!\! \in\quad\!\! \arg \min_{x\in C} \,\,\left\{ \left\langle x, \nabla f(x_k)\right\rangle + {1\over \alpha_k}B_\varphi(x, x_k)\right\}

for $\alpha_k>0$ . It's straightforward to take the subgradient of the objective and write the FOC for one step:

0 \quad\!\! \in\quad\!\! \alpha_k\nabla f(x_k) + \nabla \varphi(x_{k+1}) - \nabla \varphi (x_k) + N_C(x_{k+1}),

which, after rearranging terms, reads

x_{k+1} \quad\!\! \in\quad\!\! (\nabla\varphi + N_C)^{-1} (\nabla \varphi(x_k) - \alpha_k \nabla f(x_k)).

To complete this, observe that $(\nabla \varphi + N_C) \equiv \partial (\varphi + i_C)$ so that letting $\phi \equiv \varphi + i_C$ , and using that for $h\in \Gamma_0$ , $(\partial h)^{-1}\equiv\partial h^\star$ (cf. convex analysis part 2), we end up with the mirror descent algorithm.

The mirror descent algorithm (MDA) is the iteration

x_{k+1} \quad\!\! =\quad\!\! \nabla\phi^\star (\nabla \varphi(x_k) - \alpha_k\nabla f(x_k))

where $\phi^\star(y) = \sup_{z\in C}[\left\langle z,y\right\rangle-\varphi(z)]$ .

Note that $\phi$ is also strongly convex on $C$ so that it is differentiable which is why we can write $\nabla \phi^\star$ (its gradient is even Lipschitz as we showed in convex analysis part 3).

A final note is that, as for the PGD, the differentiability of $f$ can be relaxed to sub-differentiability without much changes, the iteration is then $x_{k+1}=\nabla\phi^\star(\nabla \varphi(x_k)-\alpha_k f'(x_k))$ with $f'(x_k)\in \partial f(x_k)$ .

References

Beck and Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization, 2003. – The paper behind the MDA, it also presents a convergence analysis and gives an example of application.
Nemirovski, Tutorial: mirror descent algorithms for large-scale deterministic and stochastic convex optimization, 2012. – A deck of slides from a 2012 presentation, covering the MDA as well as applications.