optimisation Remember that a function ψ:C→R is convex if the following inequality holds for any two x,z∈C and 0<λ<1: ψ((1−λ)x+λz)≤(1−λ)ψ(x)+λψ(z). When that inequality holds strictly for any two x=z in C, the function ψ is called strictly convex. This can also be characterised in terms of the subradient.
Let
x∈C∘ and
y∈∂ψ(x),
ψ is strictly convex on
C∘if and only if the subgradient inequality holds strictly for all
z=x:
ψ(z)>ψ(x)+⟨z−x,y⟩,∀z∈C\{x}. This simply means that the graph of the function is strictly above its supporting hyperplanes.
We can prove this quickly by noting that (1) can be rearranged into ψ(z)>ψ(x)+λψ((1−λ)x+λz)−ψ(x), now if we let y∈∂ψ(x) and using the subgradient inequality, we can also write ψ((1−λ)x+λz)≥ψ(x)+λ⟨z−x,y⟩. Combining (3) and (4) gives (2).
Example: ψ(x)=xlog(x) is strictly convex on R+ whereas ψ(x)=∣x∣ is not.
The function φ:C→R is said to be μ-strongly convex at x∈C with parameter μ>0 if the subgradient inequality holds with:
φ(z)≥φ(x)+⟨z−x,y⟩+2μ∥z−x∥22,
for any z∈C and y∈∂φ(x).
Intuitively, a strongly convex function is a strictly convex function whose graph goes away from the supporting hyperplane sufficiently fast (at least quadratically fast).
We've now introduced three types of convex functions with increasing specificity:
"just convex" (equal or above its supporting hyperplanes),
"strictly convex" (strictly above its supporting hyperplanes),
"strongly-convex" (moving at least quadratically fast away from its supporting hyperplanes).
A strongly-convex function is generally easier (faster) to optimise than a strictly convex function and the same with respect to a convex function as the additional constraints can be leveraged to speed up optimisation methods.
Let
ψ:C→R a convex function;
ψ is strictly convex on
C∘if and only if ψ⋆ is differentiable on
∂f(C∘)={y∈Rn∣∃x∈C∘,y∈∂f(x)}.
To prove this result, we can use the link between ∂ψ and ∂ψ⋆ (cf. convex analysis part 2): x∈∂f(y)∩C⟺(y∈∂f(x),x∈C). ⇒: assume f⋆ is not differentiable on C∘. Then, there is a y such that ∂f⋆(y) contains more than one points, say x1=x2 in in C∘. But then, using the equivalence (6) we have that y∈∂f(x1)∩∂f(x2). Since f is strictly convex and x1=x2 the two following strict inequalities must hold: f(x2)>f(x1)+⟨x2−x1,y⟩,f(x1)>f(x2)+⟨x1−x2,y⟩. But summing both leads to a contradiction.
⇐: assume f⋆ is differentiable and take a y with x=∇f⋆(y) in C∘. Let's now assume that f is not strictly convex on C∘. Then, there exists z∈C∘ such that f(z)=f(x)+⟨z−x,y⟩. But this can also be written f(x)=f(z)+⟨x−z,y⟩ so that y∈∂f(z) by definition of the subgradient. This means that both x and z are in ∂f⋆(y) which contradicts the differentiability of f⋆.
Consider the case where ψ is strictly convex and differentiable on C∘ so that ∂ψ(x)={∇ψ(x)} for x∈C∘. In that case, we have ψ(z)≥f(x)+⟨z−x,∇ψ(x)⟩,∀z∈C, with equality if and only if z=x. We can use this to define a notion of similarity between two points x and z in C.
Let
ψ denote a strictly convex and differentiable function on
C∘. The
Bregman-divergence on
C∘×C∘ associated with
ψ is defined as
Bψ(z,x)=ψ(z)−ψ(x)−⟨z−x,∇ψ(x)⟩. Given (8), we have that Bψ(z,x)>0 for all z=x and Bψ(z,x)=0 iff x=z which shows that Bψ is a valid divergence (positive definite functional).
For example, if we take ψ(x)=21⟨x,x⟩ then Bψ(z,x)=21⟨z,z⟩−21⟨x,x⟩−⟨z−x,x⟩=21⟨z−x,z−x⟩ which is just the squared Euclidean (ℓ2) distance between z and x. The factor 1/2 might seem irrelevant but it makes other developments a bit nicer so that, usually in convex-optimisation, the squared Euclidean distance is scaled by a factor two.
Note that if the function φ is μ-strongly convex and differentiable, then the Bregman divergence associated to it is lower bounded by the squared ℓ2-distance:
Bφ(z,x)≥2μ∥z−x∥22,∀x,z∈C. Let ψ be a differentiable and strictly convex function on C∘ then ψ⋆ is also differentiable and strictly convex on ∇ψ(C∘) and we can consider the Bregman divergence that it induces:
Bψ⋆(u,v)=ψ⋆(u)−ψ⋆(v)−⟨u−v,∇ψ⋆(v)⟩. Let us consider a pair x,z∈C∘ with u=∇ψ(x) and v=∇ψ(z). Then, since (x,u) and (z,v) are dual pairs, we have that ψ⋆(u)+ψ(x)=⟨x,u⟩ (and the same in (z,v)) as well as ∇ψ⋆(v)=z so that Bψ⋆(u,v) simplifies to Bψ⋆(u,v)=ψ(z)−ψ(x)−⟨z−x,∇ψ(x)⟩, which proves the following result.
Let
ψ be a differentiable and strictly convex function on
C∘ then the following equality holds:
Bψ⋆(∇ψ(x),∇ψ(z))=Bψ(z,x), for any
x,z∈C∘.
Observe how the arguments get "swapped".
This is a useful result to prove convergence rates of some iterative minimisation methods but can easily be skipped in a first reading.
We will show that if φ is a differentiable, strongly convex function on C∘ then ∇φ⋆ is Lipschitz continuous on ∇φ(C∘). Recall that a function ϕ is said to be β-Lipschitz-continuous on E⊆domϕ if the following inequality holds ∀u,v∈E:
∥ϕ(u)−ϕ(v)∥2≤β∥u−v∥2.
Using (11), we can write μ∥x−z∥22≤Bφ(x,z)+Bφ(z,x). for any x,z∈C∘. Plugging the definition of the Bregman divergence in the right hand side and letting u=∇φ(x) and v=∇φ(z) we get μ∥x−z∥22≤⟨z−x,u−v⟩≤∥x−z∥2∥u−v∥2, using Cauchy-Schwartz's inequality on the second line. Rearranging terms yields ∥x−z∥2≤μ1∥u−v∥2.
Since x∈C∘ and u=∇φ(x), we can write x=∇φ⋆(u) and similarly, z=∇φ⋆(v) (see convex analysis part 2). This shows that the gradient ∇φ⋆ is Lispchitz-continuous.
Let φ be a differentiable, μ-strongly convex function on C∘ then φ⋆ has a gradient that is is 1/μ-Lipschitz continuous over ∇φ(C∘), i.e.:
∥∇φ⋆(u)−∇φ⋆(v)∥≤μ1∥u−v∥,∀u,v∈∇φ(C∘), where ∇φ(C∘)={u∈Rn∣∃x∈C∘,u=∇φ(x)}.
Bregman, A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming, 1967. – This is just for fun, it's the original paper by Bregman in 1967 and though you might find it a bit hard to read (I personally don't read Russian), you should be able to recognise equation (1.4).
Goebel, Rockafellar, Local strong convexity and local Lipschitz continuity of the gradient of convex functions, 2007. – This is a more technical note covering the link between strong convexity and Lipschitz-continuity.
See also the general references mentioned in the introduction.