# Mirror descent algorithm

## From Euclidean to arbitrary projections

In the notes on the projected gradient descent (PGD) we showed that a way to solve the constrained minimisation problem with a differentiable $f$ is to follow the iteration

$x_{k+1} \quad\!\! =\quad\!\! \pi_C(x_k - \alpha_k \nabla f(x_k)),$where $\pi_C$ denotes the Euclidean projection onto $C$. By definition of the Euclidean projection, this can also be written as

$\begin{aligned} x_{k+1} \quad\!\!&=\quad\!\! \arg\min_{x\in C}\,\, \|(x_k-\alpha_k\nabla f(x_k))-x\|_2^2 \\ &=\quad\!\! \arg\min_{x\in C} \,\,\| (x-x_k) + \alpha_k\nabla f(x_k)\|_2^2.\end{aligned}$

Now, since the $\ell^2$-norm is induced by the inner product (i.e. $\left\langle x, x\right\rangle = \|x\|_2^2$), we can use the decomposition

$\|(x-x_k) + \alpha_k\nabla f(x_k)\|_2^2 \quad\!\! =\quad\!\! \|x-x_k\|_2^2 + 2\alpha_k\left\langle x, \nabla f(x_k)\right\rangle + M_k,$where $M_k$ does not depend on $x$ (and can thus be ignored). Rearranging terms, we're then left with the equivalent iteration:

$x_{k+1} \quad\!\! =\quad\!\! \arg\min_{x\in C}\,\, \left\{\left\langle x, \nabla f(x_k)\right\rangle + {1\over \alpha_k} {\|x-x_k\|_2^2\over 2}\right\}.$This new way of writing the PGD allows two important comments:

the objective shows how the PGD corresponds to a tradeoff between following the direction of the negative gradient (first term) and not moving too much from the current point (second term) while staying in $C$,

the second term is an isotropic measure of distance from the current point $x_k$, but what if we used another measure of distance?

The first point might be obvious to you if you're already seen the gradient descent and related methods but it's actually a bit deeper than what it may seems on a cursory glance, I discuss this more in thoughts on first order methods.

The second point gives the *generalised projected gradient descent*.

The *generalised projected gradient descent* (GPGD) is defined for any positive-definite function $d:\mathbb R^n\times\mathbb R^n\mapsto\mathbb R^+$ by the following iteration:

Those positive-definite functions are sometimes called *divergences* and verify $d(u, v)>0$ for $u\neq v$ and $d(u, u)=0$.

One reason why one might want to consider another distance-like function to penalise how much we move in a particular direction is that doing so can better reflect what we may know about the geometry of $C$ which can make steps easier to compute or accelerate convergence to a minimiser.

For a given divergence, there now remains to compute the corresponding iteration steps which may be intractable depending on $d$. But there happens to be a popular class of divergences for which it *is* tractable and which we have already discussed: the Bregman divergences.

## Bregman divergences and the mirror descent algorithm

Recall that for a *strictly convex* and differentiable function $\psi$, we can define the *Bregman divergence* associated with $\psi$ as a positive-definite function $B_\psi$ with

Let us now consider a $\mu$-*strongly convex* and differentiable function $\varphi$ and the associated Bregman divergence $B_\varphi$. If we use this as divergence in (5), the GPGD iteration is

for $\alpha_k>0$. It's straightforward to take the subgradient of the objective and write the FOC for one step:

$0 \quad\!\! \in\quad\!\! \alpha_k\nabla f(x_k) + \nabla \varphi(x_{k+1}) - \nabla \varphi (x_k) + N_C(x_{k+1}),$which, after rearranging terms, reads

$x_{k+1} \quad\!\! \in\quad\!\! (\nabla\varphi + N_C)^{-1} (\nabla \varphi(x_k) - \alpha_k \nabla f(x_k)).$To complete this, observe that $(\nabla \varphi + N_C) \equiv \partial (\varphi + i_C)$ so that letting $\phi \equiv \varphi + i_C$, and using that for $h\in \Gamma_0$, $(\partial h)^{-1}\equiv\partial h^\star$ (cf. convex analysis part 2), we end up with the *mirror descent algorithm*.

The *mirror descent algorithm* (**MDA**) is the iteration

where $\phi^\star(y) = \sup_{z\in C}[\left\langle z,y\right\rangle-\varphi(z)]$.

Note that $\phi$ is also strongly convex on $C$ so that it is differentiable which is why we can write $\nabla \phi^\star$ (its gradient is even Lipschitz as we showed in convex analysis part 3).

A final note is that, as for the PGD, the differentiability of $f$ can be relaxed to sub-differentiability without much changes, the iteration is then $x_{k+1}=\nabla\phi^\star(\nabla \varphi(x_k)-\alpha_k f'(x_k))$ with $f'(x_k)\in \partial f(x_k)$.

## References

**Beck**and**Teboulle**, Mirror descent and nonlinear projected subgradient methods for convex optimization, 2003. – The paper behind the MDA, it also presents a convergence analysis and gives an example of application.**Nemirovski**, Tutorial: mirror descent algorithms for large-scale deterministic and stochastic convex optimization, 2012. – A deck of slides from a 2012 presentation, covering the MDA as well as applications.