In the notes on the projected gradient descent (PGD) we showed that a way to solve the constrained minimisation problem with a differentiable is to follow the iteration
where denotes the Euclidean projection onto . By definition of the Euclidean projection, this can also be written as
Now, since the -norm is induced by the inner product (i.e. ), we can use the decomposition
where does not depend on (and can thus be ignored). Rearranging terms, we're then left with the equivalent iteration:
This new way of writing the PGD allows two important comments:
the objective shows how the PGD corresponds to a tradeoff between following the direction of the negative gradient (first term) and not moving too much from the current point (second term) while staying in ,
the second term is an isotropic measure of distance from the current point , but what if we used another measure of distance?
The first point might be obvious to you if you're already seen the gradient descent and related methods but it's actually a bit deeper than what it may seems on a cursory glance, I discuss this more in thoughts on first order methods.
The second point gives the generalised projected gradient descent.
The generalised projected gradient descent (GPGD) is defined for any positive-definite function by the following iteration:
Those positive-definite functions are sometimes called divergences and verify for and .
One reason why one might want to consider another distance-like function to penalise how much we move in a particular direction is that doing so can better reflect what we may know about the geometry of which can make steps easier to compute or accelerate convergence to a minimiser.
For a given divergence, there now remains to compute the corresponding iteration steps which may be intractable depending on . But there happens to be a popular class of divergences for which it is tractable and which we have already discussed: the Bregman divergences.
Recall that for a strictly convex and differentiable function , we can define the Bregman divergence associated with as a positive-definite function with
Let us now consider a -strongly convex and differentiable function and the associated Bregman divergence . If we use this as divergence in (5), the GPGD iteration is
for . It's straightforward to take the subgradient of the objective and write the FOC for one step:
which, after rearranging terms, reads
To complete this, observe that so that letting , and using that for , (cf. convex analysis part 2), we end up with the mirror descent algorithm.
The mirror descent algorithm (MDA) is the iteration
where .
Note that is also strongly convex on so that it is differentiable which is why we can write (its gradient is even Lipschitz as we showed in convex analysis part 3).
A final note is that, as for the PGD, the differentiability of can be relaxed to sub-differentiability without much changes, the iteration is then with .
Beck and Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization, 2003. – The paper behind the MDA, it also presents a convergence analysis and gives an example of application.
Nemirovski, Tutorial: mirror descent algorithms for large-scale deterministic and stochastic convex optimization, 2012. – A deck of slides from a 2012 presentation, covering the MDA as well as applications.