Splitting methods and ADMM

In these notes, I show how some well known methods from numerical linear algebra can be applied to convex optimisation. The aim of these notes is to give an idea for how the following topics intertwine:

From convex optimisation to linear algebra

Decomposable problems

In machine learning, imaging etc, a large portion of the convex optimisation problems can be written in the form:

x ⁣ ⁣ ⁣ ⁣argminx f(x)+g(x). x^\dagger \quad\!\! \in\quad\!\! \arg\min_x\,\, f(x) + g(x).

This includes constrained problems where gιCg\equiv \iota_C for a convex set CC or penalised problems like the LASSO regression:

x ⁣ ⁣ ⁣ ⁣argminx 12Axb22+λx1 x^\dagger \quad\!\! \in\quad\!\!\arg\min_x\,\, \frac12\|Ax-b\|_2^2 + \lambda\|x\|_1

In a similar vein as for the previous notes, the following regularity conditions are usually assumed to hold:

  1. f,gΓ0f, g\in \Gamma_0, the space of convex, proper, lower semi-continuous functions,

  2. f,gf, g are such that on dom fdom g\mathrm{dom}\, f \cap \mathrm{dom}\, g, (f+g)=f+g\partial(f+g)=\partial f+\partial g.

As we showed in convex analysis part 1, for xx^\dagger to be a minimiser, it must verify the first order condition, i.e.:

0 ⁣ ⁣ ⁣ ⁣(f+g)(x), 0 \quad\!\! \in\quad\!\! (\partial f+\partial g)(x^\dagger),

the issue being that, in most cases, we don't have a (cheaply available) closed-form expression for the inverse operator (otherwise the problem is trivial).

This issue can in fact be related to the classical problem of solving a linear system of equations:

b ⁣ ⁣= ⁣ ⁣Ax b \quad\!\! =\quad\!\! Ax

where AA is invertible but is, for instance, too big or too poorly conditioned for its inverse to be computable cheaply and reliably.

Operator splitting methods in linear algebra

One way of attempting to solve (4) without computing the inverse of AA is to consider a splitting method: a decomposition of AA into a sum of matrices A=B+CA=B+C with desirable properties. The equation (4) can then be written in the form of a fixed-point equation:

Bx ⁣ ⁣= ⁣ ⁣bCx. Bx \quad\!\! =\quad\!\! b-Cx.

Assuming that BB is easier to invert than AA, we can consider the fixed-point iteration algorithm:

xk+1 ⁣ ⁣= ⁣ ⁣B1(bCxk). x_{k+1} \quad\!\! =\quad\!\! B^{-1} (b-Cx_k).

There are two classical examples of this type of splitting:

  1. the Jacobi method, writing A=D+RA=D+R with DD diagonal,

  2. the Gauss-Seidel method, writing A=(L+D)+UA=(L+D)+U with LL and UU lower and upper triangular respectively.

Under some conditions, the corresponding fixed-point iterations converge (see also Ortega and Rheinboldt (2000)). For instance if AA is symmetric, positive definite or if it is diagonally dominant then Gauss-Seidel converges.

DPR splitting

Researchers like Douglas, Peaceman and Rachford studied this in the mid 1950s to solve linear systems arising from the discretisation of systems of partial differential equations (Peaceman and Rachford (1955), Douglas and Rachford (1956)). They came up with what is now known as the Douglas-Rachford splitting and the Peaceman-Rachford splitting.

The context is simple: assume that we have a decomposition A=B+CA=B+C where BB and/or CC are poorly conditioned or even singular. In that case, one can try to regularise them by writing

A ⁣ ⁣= ⁣ ⁣(B+αI)+(CαI) A \quad\!\! =\quad\!\! (B+\alpha \mathbf I) + (C-\alpha \mathbf I)

for some α>0\alpha>0 which will shift the minimum singular value of BB and CC away from zero (and thereby push them towards diagonally dominant matrices). The fixed-point corresponding to this split is

(B+αI)x ⁣ ⁣= ⁣ ⁣b(CαI)x. (B+\alpha \mathbf I) x \quad\!\! =\quad\!\! b-(C-\alpha\mathbf I)x.

Observe that for a suitably large α\alpha we could also consider the fixed-point derived from (8) where the role of BB and CC are swapped. The resulting fixed point equation is equivalent to (8) but the fixed-point iteration is not and the DPR method suggests alternating between both.

(DPR iterative method) let Ax=bAx=b and A=B+CA=B+C, the DPR iterative method is given by

{(B+αI)xk+1= ⁣ ⁣(b+(αIC)zk)(C+αI)zk+1= ⁣ ⁣(b+(αIB)xk+1) \begin{cases} (B+\alpha\mathbf I)x_{k+1} &=\quad\!\! (b+(\alpha \mathbf I - C)z_k)\\ (C+\alpha\mathbf I)z_{k+1} &=\quad\!\! (b+(\alpha \mathbf I-B)x_{k+1}) \end{cases}

and converges to the solution provided α\alpha is sufficiently large.

This method belongs to a class of method known as alternating direction methods...

DPR splitting for the kernel

Consider now the kernel problem i.e. finding xx such that Ax=0Ax=0 (i.e. b=0b=0) still with A=(B+C)A=(B+C). Let y=Cx=Bxy=Cx=-Bx then we can consider a triplet of fixed points:

{(B+αI)x= ⁣ ⁣(αIC)x+(Cxy)(C+αI)x= ⁣ ⁣(αIB)x(Bxy)y= ⁣ ⁣Cx \begin{cases} (B+\alpha\mathbf I) x &=\quad\!\! (\alpha\mathbf I - C)x + \textcolor{blue}{(Cx-y)}\\ (C+\alpha\mathbf I)x &=\quad\!\! (\alpha\mathbf I - B)x - \textcolor{blue}{(-Bx-y)}\\ \textcolor{blue}{y} &=\quad\!\! \textcolor{blue}{Cx} \end{cases}

We can then intertwine the corresponding fixed-point iterations as follows:

{(B+αI)xk+1= ⁣ ⁣(αIC)zk+(Czkyk)(C+αI)zk+1= ⁣ ⁣(αIB)xk+1(Bxk+1yk)yk+1= ⁣ ⁣Czk+1 \begin{cases} (B+\alpha\mathbf I) x_{k+1} &=\quad\!\! (\alpha\mathbf I - C)z_k + (Cz_k-y_k)\\ (C+\alpha\mathbf I) z_{k+1} &=\quad\!\! (\alpha\mathbf I - B)x_{k+1} - (-Bx_{k+1}-y_k)\\ \textcolor{blue}{y_{k+1}} &=\quad\!\! Cz_{k+1} \end{cases}

Let now uk=yk/αu_k=y_k/\alpha and note that Czk+1=(αxk+1+ykαzk+1)=α(xk+1+ukzk+1)Cz_{k+1}= (\alpha x_{k+1} + y_k - \alpha z_{k+1}) = \alpha(x_{k+1} + u_k - z_{k+1}) using the second iteration. This leads to an iterative method to solve Ax=0Ax=0 which we will show to be directly connected to the ADMM.

(DPR iterative method (2)) let Ax=0Ax=0 and A=B+CA=B+C, the DPR2 iterative method is given by

{(B+αI)xk+1= ⁣ ⁣α(zkuk)(C+αI)zk+1= ⁣ ⁣α(zk+uk)uk+1= ⁣ ⁣uk+xk+1zk+1 \begin{cases} (B+\alpha\mathbf I)x_{k+1} &=\quad\!\! \alpha(z_k - u_k)\\ (C+\alpha\mathbf I)z_{k+1} &=\quad\!\! \alpha(z_k + u_k)\\ u_{k+1} &=\quad\!\! u_k + x_{k+1} - z_{k+1} \end{cases}

and converges to the solution provided α\alpha is sufficiently large.

From linear algebra back to convex optimisation

Going back to problem (2), we had noted that a minimiser must be in the kernel of (f+g)(\partial f+\partial g):

0 ⁣ ⁣ ⁣ ⁣(f+g)(x). 0 \quad\!\! \in\quad\!\! (\partial f+\partial g)(x^\dagger).

Since we've just seen that splitting operators could be a good idea in linear algebra, we could be tempted to apply exactly the same approach here. But in order to do this, we need to consider the inverse of the following two operators: (f+αI)(\partial f+\alpha \mathbf I) and (g+αI)(\partial g+\alpha \mathbf I).

Proximal operators

Proximal operators can be recovered from a number of nice perspectives and are usually attributed to Moreau (see e.g. Moreau (1965)). Here we'll just cover it briefly aiming to define the prox of a function ff denoted proxf\mathrm{prox}_f and show a key result, i.e.: that proxf(f+I)1\mathrm{prox}_f \equiv (\partial f+\mathbf I)^{-1}.

Let xx and zz be such that z(f+I)(x)z \in (\partial f + \mathbf I)(x). We are interested in the inverse map or, in other words, in having xx in terms of zz. Rearranging the equation note that we have

0 ⁣ ⁣ ⁣ ⁣f(x)+(xz). 0 \quad\!\! \in\quad\!\! \partial f(x) + (x-z).

Observe that the simple linear functional (xz)(x-z) can be re-expressed as the gradient of a squared 2\ell^2 norm:

[12xz22] ⁣ ⁣= ⁣ ⁣xz. \partial \left[\frac12\|x-z\|_2^2\right] \quad\!\! =\quad\!\! x-z.

Therefore, we can write (14) as

0 ⁣ ⁣ ⁣ ⁣[f+12z22](x). 0 \quad\!\! \in\quad\!\! \partial \left[f + {1\over 2}\|\cdot-z\|_2^2\right] (x).

This can be interpreted as a first order condition (FOC) and is equivalent to

x ⁣ ⁣ ⁣ ⁣argminu f(u)+12uz22 x \quad\!\! \in\quad\!\! \arg\min_u \, f(u)+{1\over 2}\|u-z\|_2^2

which defines the prox of ff.

For a convex function ff, the proximal operator of ff at a point zz is defined as

proxf(z) ⁣ ⁣= ⁣ ⁣argminu f(u)+12uz22 \mathrm{prox}_f(z) \quad\!\! =\quad\!\! \arg\min_u \, f(u)+\frac12\|u-z\|_2^2

and is such that proxf(f+I)1\mathrm{prox}_f \equiv (\partial f + \mathbf I)^{-1}.

Note that (f+αI)=α((α1f)+I)(\partial f+\alpha \mathbf I) = \alpha (\partial (\alpha^{-1} f)+\mathbf I) so that

α(f+αI)1 ⁣ ⁣ ⁣ ⁣proxα1f. \alpha(\partial f + \alpha \mathbf I)^{-1} \quad\!\! \equiv\quad\!\! \mathrm{prox}_{\alpha^{-1} f}.

Note also that if α\alpha is sufficiently large, then the objective in (17) is strongly-convex and therefore can only have a unique minimiser meaning that proxα1f\mathrm{prox}_{\alpha^{-1} f} is then a well-defined function.

Remark: it may look like we just conjured this proximal operator out of the abyss for nothing but it turns out that a proximal operator exists in closed form for a number of important functions. Among the most known examples is the 1\ell^1-norm whose prox is the soft-thresholding operator and the ιC\iota_C indicator of a convex set whose proximal operator is the orthogonal projection on that set.

ADMM

Hopefully you saw this one coming: if you take DPR2 (12) and simply replace BB by f\partial f, CC by g\partial g and pepper with prox\mathrm{prox} you get the ADMM (see e.g. Combettes and Pesquet (2011)).

(Alternative direction method of multipliers (ADMM)) the minimisation problem (2) can be tackled with the following elegant iteration:

{xk+1= ⁣ ⁣proxγf(zkuk)zk+1= ⁣ ⁣proxγg(xk+1+uk)uk+1= ⁣ ⁣uk+xk+1zk+1 \begin{cases} x_{k+1} &=\quad\!\! \mathrm{prox}_{\gamma f}(z_k-u_k)\\ z_{k+1} &=\quad\!\! \mathrm{prox}_{\gamma g}(x_{k+1}+u_k)\\ u_{k+1} &=\quad\!\! u_k + x_{k+1} - z_{k+1} \end{cases}

which converges provided γ>0\gamma>0 is small enough.

When is this helpful?: a frequent scenario has ff complex but differentiable and gg simple but non-differentiable (e.g. 1\ell^1-norm); in that case, the first prox is a differentiable problem that can be (approximately) solved using a simple/cheap first-order method and the second prox exists in closed form. For instance, regularised maximum likelihood estimation or regularised inverse problems typically have this form.

References

Proximal methods

  1. Combettes and Pesquet, Proximal splitting methods in signal processing, 2011. – A detailed review on proximal methods, accessible and comprehensive.

  2. Moreau, Proximité et dualité dans un espace hilbertien, 1965. – A wonderful seminal paper, clear and complete, a great read if you understand French (and even if you don't you should be able to follow the equations).

Linear algebra

  1. Peaceman and Rachford, The numerical solution of parabolic and elliptic differential equations, 1955.

  2. Douglas and Rachford, On the numerical solution of heat conduction problems in two and three space variables, 1956.

  3. Ortega and Rheinboldt, Iterative solutions of nonlinear equations in several variables, 2000.