# Mirror descent algorithm

## From Euclidean to arbitrary projections

In the notes on the projected gradient descent (PGD) we showed that a way to solve the constrained minimisation problem with a differentiable $f$ is to follow the iteration

$x_{k+1} \quad\!\! =\quad\!\! \pi_C(x_k - \alpha_k \nabla f(x_k)),$

where $\pi_C$ denotes the Euclidean projection onto $C$. By definition of the Euclidean projection, this can also be written as

\begin{aligned} x_{k+1} \quad\!\!&=\quad\!\! \arg\min_{x\in C}\,\, \|(x_k-\alpha_k\nabla f(x_k))-x\|_2^2 \\ &=\quad\!\! \arg\min_{x\in C} \,\,\| (x-x_k) + \alpha_k\nabla f(x_k)\|_2^2.\end{aligned}

Now, since the $\ell^2$-norm is induced by the inner product (i.e. $\left\langle x, x\right\rangle = \|x\|_2^2$), we can use the decomposition

$\|(x-x_k) + \alpha_k\nabla f(x_k)\|_2^2 \quad\!\! =\quad\!\! \|x-x_k\|_2^2 + 2\alpha_k\left\langle x, \nabla f(x_k)\right\rangle + M_k,$

where $M_k$ does not depend on $x$ (and can thus be ignored). Rearranging terms, we're then left with the equivalent iteration:

$x_{k+1} \quad\!\! =\quad\!\! \arg\min_{x\in C}\,\, \left\{\left\langle x, \nabla f(x_k)\right\rangle + {1\over \alpha_k} {\|x-x_k\|_2^2\over 2}\right\}.$

This new way of writing the PGD allows two important comments:

1. the objective shows how the PGD corresponds to a tradeoff between following the direction of the negative gradient (first term) and not moving too much from the current point (second term) while staying in $C$,

2. the second term is an isotropic measure of distance from the current point $x_k$, but what if we used another measure of distance?

The first point might be obvious to you if you're already seen the gradient descent and related methods but it's actually a bit deeper than what it may seems on a cursory glance, I discuss this more in thoughts on first order methods.

The second point gives the generalised projected gradient descent.

The generalised projected gradient descent (GPGD) is defined for any positive-definite function $d:\mathbb R^n\times\mathbb R^n\mapsto\mathbb R^+$ by the following iteration:

$x_{k+1} \quad\!\! =\quad\!\! \arg\min_{x\in C}\,\, \left\{\left\langle x, \nabla f(x_k)\right\rangle + {1\over \alpha_k} d(x, x_k)\right\}.$

Those positive-definite functions are sometimes called divergences and verify $d(u, v)>0$ for $u\neq v$ and $d(u, u)=0$.

One reason why one might want to consider another distance-like function to penalise how much we move in a particular direction is that doing so can better reflect what we may know about the geometry of $C$ which can make steps easier to compute or accelerate convergence to a minimiser.

For a given divergence, there now remains to compute the corresponding iteration steps which may be intractable depending on $d$. But there happens to be a popular class of divergences for which it is tractable and which we have already discussed: the Bregman divergences.

## Bregman divergences and the mirror descent algorithm

Recall that for a strictly convex and differentiable function $\psi$, we can define the Bregman divergence associated with $\psi$ as a positive-definite function $B_\psi$ with

$B_\psi(x, y) \quad\!\! :=\quad\!\! \psi(x)-\psi(y)-\left\langle x - y, \nabla \psi(y)\right\rangle.$

Let us now consider a $\mu$-strongly convex and differentiable function $\varphi$ and the associated Bregman divergence $B_\varphi$. If we use this as divergence in (5), the GPGD iteration is

$x_{k+1} \quad\!\! \in\quad\!\! \arg \min_{x\in C} \,\,\left\{ \left\langle x, \nabla f(x_k)\right\rangle + {1\over \alpha_k}B_\varphi(x, x_k)\right\}$

for $\alpha_k>0$. It's straightforward to take the subgradient of the objective and write the FOC for one step:

$0 \quad\!\! \in\quad\!\! \alpha_k\nabla f(x_k) + \nabla \varphi(x_{k+1}) - \nabla \varphi (x_k) + N_C(x_{k+1}),$

$x_{k+1} \quad\!\! \in\quad\!\! (\nabla\varphi + N_C)^{-1} (\nabla \varphi(x_k) - \alpha_k \nabla f(x_k)).$

To complete this, observe that $(\nabla \varphi + N_C) \equiv \partial (\varphi + i_C)$ so that letting $\phi \equiv \varphi + i_C$, and using that for $h\in \Gamma_0$, $(\partial h)^{-1}\equiv\partial h^\star$ (cf. convex analysis part 2), we end up with the mirror descent algorithm.

The mirror descent algorithm (MDA) is the iteration

$x_{k+1} \quad\!\! =\quad\!\! \nabla\phi^\star (\nabla \varphi(x_k) - \alpha_k\nabla f(x_k))$

where $\phi^\star(y) = \sup_{z\in C}[\left\langle z,y\right\rangle-\varphi(z)]$.

Note that $\phi$ is also strongly convex on $C$ so that it is differentiable which is why we can write $\nabla \phi^\star$ (its gradient is even Lipschitz as we showed in convex analysis part 3).

A final note is that, as for the PGD, the differentiability of $f$ can be relaxed to sub-differentiability without much changes, the iteration is then $x_{k+1}=\nabla\phi^\star(\nabla \varphi(x_k)-\alpha_k f'(x_k))$ with $f'(x_k)\in \partial f(x_k)$.

1. Beck and Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization, 2003. – The paper behind the MDA, it also presents a convergence analysis and gives an example of application.

2. Nemirovski, Tutorial: mirror descent algorithms for large-scale deterministic and stochastic convex optimization, 2012. – A deck of slides from a 2012 presentation, covering the MDA as well as applications.