Convex analysis (part III)

Strict and strong convexity

Remember that a function ψ:CR\psi:C\to \mathbb R is convex if the following inequality holds for any two x,zCx, z \in C and 0<λ<10<\lambda<1: ψ((1λ)x+λz)(1λ)ψ(x)+λψ(z).\begin{array}{rcl} \psi((1-\lambda)x+\lambda z) &\le & (1-\lambda)\psi(x) + \lambda \psi(z). \end{array} When that inequality holds strictly for any two xzx\neq z in CC, the function ψ\psi is called strictly convex. This can also be characterised in terms of the subradient.

Let xCx\in C^\circ and yψ(x)y\in \partial \psi(x), ψ\psi is strictly convex on CC^\circif and only if the subgradient inequality holds strictly for all zxz\neq x: ψ(z)>ψ(x)+zx,y,zC\{x}.\begin{array}{rcl} \psi(z) &>& \psi(x) + \left\langle z-x, y\right\rangle, \quad\forall z\in C\backslash\{x\}.\end{array}

This simply means that the graph of the function is strictly above its supporting hyperplanes.

We can prove this quickly by noting that (1) can be rearranged into ψ(z)>ψ(x)+ψ((1λ)x+λz)ψ(x)λ,\begin{array}{rcl} \psi(z) &>&\psi(x) + {\psi((1-\lambda)x+\lambda z) - \psi(x)\over \lambda}, \end{array} now if we let yψ(x)y\in\partial \psi(x) and using the subgradient inequality, we can also write ψ((1λ)x+λz)ψ(x)+λzx,y.\begin{array}{rcl} \psi((1-\lambda)x+\lambda z) &\ge& \psi(x) + \lambda\left\langle z-x, y\right\rangle.\end{array} Combining (3) and (4) gives (2).

Example: ψ(x)=xlog(x)\psi(x)=x\log(x) is strictly convex on R+\mathbb R^+ whereas ψ(x)=x\psi(x)=|x| is not.

The function φ:CR\varphi:C\to \mathbb R is said to be μ\mu-strongly convex at xCx\in C with parameter μ>0\mu>0 if the subgradient inequality holds with:

φ(z)φ(x)+zx,y+μ2zx22,\begin{array}{rcl} \varphi(z) &\ge& \varphi(x) + \left\langle z-x, y\right\rangle + {\mu\over 2}\|z-x\|_2^2,\end{array}

for any zCz\in C and yφ(x)y\in\partial \varphi(x).

Intuitively, a strongly convex function is a strictly convex function whose graph goes away from the supporting hyperplane sufficiently fast (at least quadratically fast).

We've now introduced three types of convex functions with increasing specificity:

A strongly-convex function is generally easier (faster) to optimise than a strictly convex function and the same with respect to a convex function as the additional constraints can be leveraged to speed up optimisation methods.

Strict convexity and differentiability

Let ψ:CR\psi:C\to \mathbb R a convex function; ψ\psi is strictly convex on CC^\circif and only if ψ\psi^\star is differentiable on f(C)={yRn xC,yf(x)}\partial f(C^\circ) = \{y\in\mathbb R^n\,|\,\exists x\in C^\circ, y\in\partial f(x)\}.

To prove this result, we can use the link between ψ\partial\psi and ψ\partial\psi^{\star} (cf. convex analysis part 2): xf(y)C(yf(x),xC).\begin{array}{rcl} x\in \partial f(y)\cap C &\Longleftrightarrow& (y\in\partial f(x), x\in C). \end{array} \Rightarrow: assume ff^\star is not differentiable on CC^\circ. Then, there is a yy such that f(y)\partial f^\star(y) contains more than one points, say x1x2x_1\neq x_2 in in CC^\circ. But then, using the equivalence (6) we have that yf(x1)f(x2)y\in\partial f(x_1) \cap \partial f(x_2). Since ff is strictly convex and x1x2x_1\neq x_2 the two following strict inequalities must hold: f(x2) ⁣ ⁣> ⁣ ⁣f(x1)+x2x1,y,f(x1) ⁣ ⁣> ⁣ ⁣f(x2)+x1x2,y.\begin{aligned} f(x_2)\quad\!\! >\quad\!\!f(x_1)+\left\langle x_2-x_1, y\right\rangle,\\ f(x_1)\quad\!\! >\quad\!\!f(x_2)+\left\langle x_1-x_2, y\right\rangle.\end{aligned} But summing both leads to a contradiction.

\Leftarrow: assume ff^\star is differentiable and take a yy with x=f(y)x=\nabla f^\star(y) in CC^\circ. Let's now assume that ff is not strictly convex on CC^\circ. Then, there exists zCz\in C^\circ such that f(z)=f(x)+zx,yf(z)=f(x)+\left\langle z-x, y\right\rangle. But this can also be written f(x)=f(z)+xz,yf(x)=f(z)+\left\langle x-z,y\right\rangle so that yf(z)y\in\partial f(z) by definition of the subgradient. This means that both xx and zz are in f(y)\partial f^\star(y) which contradicts the differentiability of ff^\star.

Bregman divergence

Consider the case where ψ\psi is strictly convex and differentiable on CC^\circ so that ψ(x)={ψ(x)}\partial \psi(x)=\{\nabla \psi(x)\} for xCx\in C^\circ. In that case, we have ψ(z)f(x)+zx,ψ(x),zC,\begin{array}{rcl} \psi(z) &\ge& f(x) + \left\langle z-x, \nabla \psi(x)\right\rangle, \quad \forall z\in C,\end{array} with equality if and only if z=xz=x. We can use this to define a notion of similarity between two points xx and zz in CC.

Let ψ\psi denote a strictly convex and differentiable function on CC^\circ. The Bregman-divergence on C×CC^\circ\times C^\circ associated with ψ\psi is defined as Bψ(z,x)=ψ(z)ψ(x)zx,ψ(x).\begin{array}{rcl} B_\psi(z, x) &=& \psi(z)-\psi(x)-\left\langle z-x,\nabla \psi(x)\right\rangle.\end{array}

Given (8), we have that Bψ(z,x)>0B_\psi(z, x)>0 for all zxz\neq x and Bψ(z,x)=0B_\psi(z, x)=0 iff x=zx=z which shows that BψB_\psi is a valid divergence (positive definite functional).

For example, if we take ψ(x)=12x,x\psi(x) = \frac12\left\langle x, x\right\rangle then Bψ(z,x) ⁣ ⁣= ⁣ ⁣12z,z12x,xzx,x= ⁣ ⁣12zx,zx\begin{aligned} B_\psi(z, x) \quad\!\!&=\quad\!\! \frac12\left\langle z, z\right\rangle - \frac12\left\langle x, x\right\rangle - \left\langle z-x, x\right\rangle\\ &=\quad\!\! \frac12\left\langle z-x, z-x\right\rangle\end{aligned} which is just the squared Euclidean (2\ell^{2}) distance between zz and xx. The factor 1/21/2 might seem irrelevant but it makes other developments a bit nicer so that, usually in convex-optimisation, the squared Euclidean distance is scaled by a factor two.

Note that if the function φ\varphi is μ\mu-strongly convex and differentiable, then the Bregman divergence associated to it is lower bounded by the squared 2\ell^{2}-distance:

Bφ(z,x)μ2zx22,x,zC.\begin{array}{rcl} B_{\varphi}(z,x) &\ge& {\mu\over 2} {\|z-x\|^{2}_{2}}, \quad\forall x,z \in C. \end{array}

Bregman divergence and convex conjugacy

Let ψ\psi be a differentiable and strictly convex function on CC^\circ then ψ\psi^\star is also differentiable and strictly convex on ψ(C)\nabla \psi(C^\circ) and we can consider the Bregman divergence that it induces:

Bψ(u,v)=ψ(u)ψ(v)uv,ψ(v).\begin{array}{rcl} B_{\psi^\star}(u, v) &=& \psi^\star(u) - \psi^\star(v) - \left\langle u-v, \nabla\psi^\star(v)\right\rangle.\end{array}

Let us consider a pair x,zCx, z\in C^\circ with u=ψ(x)u=\nabla\psi(x) and v=ψ(z)v=\nabla\psi(z). Then, since (x,u)(x, u) and (z,v)(z, v) are dual pairs, we have that ψ(u)+ψ(x)=x,u\psi^\star(u)+\psi(x)=\left\langle x, u\right\rangle (and the same in (z,v)(z, v)) as well as ψ(v)=z\nabla\psi^\star(v)=z so that Bψ(u,v)B_{\psi^\star}(u, v) simplifies to Bψ(u,v)=ψ(z)ψ(x)zx,ψ(x),\begin{array}{rcl} B_{\psi^\star}(u, v) &=& \psi(z) - \psi(x) - \left\langle z-x, \nabla \psi(x)\right\rangle,\end{array} which proves the following result.

Let ψ\psi be a differentiable and strictly convex function on CC^\circ then the following equality holds: Bψ(ψ(x),ψ(z))=Bψ(z,x),\begin{array}{rcl} B_{\psi^\star}(\nabla\psi(x), \nabla\psi(z)) &=& B_\psi(z, x),\end{array} for any x,zCx,z \in C^\circ.

Observe how the arguments get "swapped".

Lipschitz continuity and strong convexity

This is a useful result to prove convergence rates of some iterative minimisation methods but can easily be skipped in a first reading.

We will show that if φ\varphi is a differentiable, strongly convex function on CC^\circ then φ\nabla \varphi^\star is Lipschitz continuous on φ(C)\nabla\varphi(C^\circ). Recall that a function ϕ\phi is said to be β\beta-Lipschitz-continuous on Edom ϕE\subseteq \mathrm{dom}\,\phi if the following inequality holds u,vE\forall u,v\in E:

ϕ(u)ϕ(v)2βuv2.\begin{array}{rcl} \|\phi(u)-\phi(v)\|_2 &\le& \beta \|u-v\|_2.\end{array}

Using (11), we can write μxz22Bφ(x,z)+Bφ(z,x).\begin{array}{rcl} \mu \|x-z\|^{2}_{2} &\le& B_{\varphi}(x,z)+B_{\varphi}(z,x) .\end{array} for any x,zCx, z\in C^\circ. Plugging the definition of the Bregman divergence in the right hand side and letting u=φ(x)u=\nabla \varphi(x) and v=φ(z)v=\nabla \varphi(z) we get μxz22 ⁣ ⁣ ⁣ ⁣zx,uv ⁣ ⁣ ⁣ ⁣xz2uv2,\begin{aligned} \mu\| x-z\|^{2}_{2} &\quad\!\! \le\quad\!\! \langle z-x, u-v \rangle \\ &\quad\!\! \le\quad\!\! \|x-z\|_2\|u-v\|_2,\end{aligned} using Cauchy-Schwartz's inequality on the second line. Rearranging terms yields xz21μuv2.\begin{array}{rcl} \|x-z\|_{2} &\le& {1\over \mu}\| u-v\|_2.\end{array}

Since xCx\in C^\circ and u=φ(x)u=\nabla \varphi(x), we can write x=φ(u)x = \nabla \varphi^\star(u) and similarly, z=φ(v)z=\nabla\varphi^\star(v) (see convex analysis part 2). This shows that the gradient φ\nabla\varphi^\star is Lispchitz-continuous.

Let φ\varphi be a differentiable, μ\mu-strongly convex function on CC^\circ then φ\varphi^\star has a gradient that is is 1/μ1/\mu-Lipschitz continuous over φ(C)\nabla \varphi(C^\circ), i.e.:

φ(u)φ(v)1μuv,u,vφ(C),\begin{array}{rcl} \| \nabla \varphi^\star(u)-\nabla \varphi^\star(v) \| &\le& {1\over \mu}\|u-v\|,\quad \forall u,v\in\nabla\varphi(C^\circ),\end{array} where φ(C)={uRn xC,u=φ(x)}\nabla \varphi (C^\circ)=\{u\in\mathbb R^n\,|\,\exists x\in C^\circ, u=\nabla \varphi(x)\}.

Short references

  1. Bregman, A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming, 1967. – This is just for fun, it's the original paper by Bregman in 1967 and though you might find it a bit hard to read (I personally don't read Russian), you should be able to recognise equation (1.4).

  2. Goebel, Rockafellar, Local strong convexity and local Lipschitz continuity of the gradient of convex functions, 2007. – This is a more technical note covering the link between strong convexity and Lipschitz-continuity.

See also the general references mentioned in the introduction.