GMRES & Conjugate Gradient (part II)
GMRES and Krylov subspaces
Usually in the literature people present Krylov subspaces first and GMRES and related techniques after. Here we do the opposite as we came to GMRES in a different way. Starting from the iterative procedure, we now connect the dots with the Krylov subspaces.
For chronology's sake, note that Krylov subspaces where introduced by A. Krylov before the 1950s while GMRES was discussed by Y. Saad and M. Schultz in 1986 (Saad and Schultz (1986)).
Back to the present, we discussed iterative methods for solving with non singular and and with real entries:
for some linearly independent and coefficients . One of the choice of directions is , which corresponds to the classical GMRES algorithm.
Observe that, for this choice, we have and
Expanding that product, we could find some such that
The coefficients don't matter, what does matter is that this is a linear combination of the vectors for which span the Krylov subspace with
we will just write when and are obvious from the context.
Since the and therefore the each live in , the given by (1) are also in . The iteration can then be interpreted as iteratively attempting to represent into successive Krylov subspaces.
Standard GMRES corresponds to the choice with the minimum residual criterion (cf. previous post). Overall, this can now be re-expressed as:
Petrov-Galerkin condition and FOM
In the previous post, we mostly discussed the choice of directions and not the criterion for choosing the approximation which we fixed at the minimum residual criterion.
A different criterion that can be considered is the Petrov-Galerkin condition requiring the residual to be orthogonal to the space spanned by the directions . This condition leads to the Full Orthogonalisation Method or FOM Saad (2003).
For generic matrices, this method does not perform very well, but for specific matrices, it leads to the very effective Conjugate Gradient method as will be seen in the next point.
If we denote the orthonormal basis corresponding to the , the Petrov-Galerkin condition amounts to for which, in matrix form, reads
where is the matrix with columns . Following the same approach as in the discussion of the minres criterion, we can iteratively form such that . With for some to determine and letting the vector with components , the condition then reads
Like in the minres case, this is just a regression problem in and some tricks can be applied to solve this cheaply. We dive much deeper though as, in the general case, FOM behaves much worse than GMRES as is illustrated below in a rough benchmark.
(Rough) implementation
Most of the code here is drawn from the previous post, we simplify the resolution of the least-square problem to get the using the generic backslash operator (we shouldn't but it makes the code simpler). The main goal here is to get a quick comparison of using the minres vs the Petrov-Galerkin criterion.
using LinearAlgebra, StableRNGs
function mgs!(Q::Matrix, R::Matrix, v::Vector, k::Int)
n = length(v)
Q[:, k] .= v
for i = 1:k-1
R[i, k] = dot(Q[:, k], Q[:, i])
Q[:, k] .-= R[i, k] .* Q[:, i]
end
Q[:, k] = normalize(Q[:,k])
R[k, k] = dot(v, Q[:, k])
return
end
function itersolve(A::Matrix, b::Vector, x0::Vector;
niter::Int=10, dirs=:rand, criterion=:minres)
start = time()
n = size(A, 1)
r0 = A * x0 - b
xk = copy(x0)
rk = copy(r0)
pk = zeros(n)
Q = zeros(n, niter)
Q̃ = zeros(n, niter)
R = zeros(niter, niter)
M = zeros(niter, niter)
norm_rk = zeros(niter + 1)
norm_rk[1] = norm(r0)
times = zeros(niter+1)
times[1] = time() - start
#@views for k = 1:niter
for k = 1:niter
if dirs == :rand
pk = randn(n)
elseif dirs == :krylov
pk = rk
elseif dirs == :grad
pk = A' * rk
end
mgs!(Q, R, pk, k)
mgs!(Q̃, R, A*Q[:, k], k)
if criterion == :minres
γk = -Q̃[:, 1:k]' * r0
βk = R[1:k, 1:k] \ γk
elseif criterion == :fom
# compute M = (Q'Q̃) iteratively
if k > 1
M[1:k-1, k] .= Q[:, 1:k-1]' * Q̃[:, k]
M[k, 1:k-1] .= vec(Q[:, k]' * Q̃[:, 1:k-1])
end
M[k, k] = dot(Q[:, k], Q̃[:, k])
# solve (Q'Q̃R)β ≈ (-Q'r0)
βk = (M[1:k, 1:k] * R[1:k, 1:k]) \ (-Q[:, 1:k]' * r0)
end
xk .= x0 .+ Q[:, 1:k] * βk
rk .= A * xk .- b
norm_rk[k+1] = norm(rk)
times[k+1] = time() - start
end
return xk, norm_rk, times
end
itersolve (generic function with 1 method)
As usual, the code above is not optimised, particularly at places S1 and S2, a dedicated procedure should be used rather than the generic backslash operator.
Our way to compute the coefficients corresponding to the FOM criterion is very crude but our goal here is simply to have a quick comparison with the other.
using PyPlot
rng = StableRNG(908)
n = 100
A = randn(rng, n, n)^2 .+ 5 * rand(rng, n, n)
x = randn(rng, n)
b = A * x
x0 = randn(rng, n)
r0 = A * x0 - b
figure(figsize=(8, 6))
for criterion in (:minres, :fom) # (:minres, :fom)
for dirs in (:rand, :grad, :krylov)
res = itersolve(A, b, x0; niter=n, dirs=dirs, criterion=criterion)
semilogy(res[2], label="$(dirs)_$(criterion)")
end
end
xlabel("Number of iterations")
ylabel("Norm of the residuals")
legend()
As can be seen on the graph above, FOM can have a much less smooth convergence and GMRES is usually preferred to it (at least to the naive form presented above).
Conjugate gradient method
While FOM typically underperforms GMRES-style updates, in the case where is symmetric positive definite (SPD) it can be modified to lead to a very useful and efficient algorithm: the conjugate gradient method.
For the rest of this section, is assumed to be SPD i.e. for any . Recall that if is SPD it induces an inner-product on with
Indeed, it is trivially bilinear, symmetric since and positive definite since is positive definite. Correspondingly, we can define a norm .
Petrov-Galerkin condition again
Let's start with the basic iteration again, assuming we have an iteratively constructed set of linearly-independent which we express in some base . Then, the iteration amount to
with the such that is perpendicular to the space spanned by the (and so the ) i.e.:
Now since , and that, by construction, must be orthogonal to for , the condition becomes
In the case where is SPD, we can construct a base to make the above condition particularly simple to verify. Before doing that, we give an alternative interpretation of using the PG condition.
Correspondance between Petrov-Galerkin and -projections
With the choice , the span the Krylov subspace , and the orthogonality condition requires that be orthogonal to . When is SPD, we show below that the iterations with the PG condition are equivalent to the following iterations:
where and . In other words, a sequence of projections of onto in the geometry induced by ("-projections").
In order to show the equivalence, let's expand the -norm and drop the term that doesn't depend on :
This is a convex minimisation problem since the objective is convex (as is SPD) over a convex set (convex since it's an affine subspace). We can thus consider the first order optimality condition (FOC) of the problem. The gradient of the objective function is and the FOC at the minimiser is:
where the left-hand-side is the normal cone to . This amounts to requiring
Since both and are in , we can let and require for any . For this to always hold we must have be -orthogonal to or:
Equivalently, for all or
with , and , the above equation is equivalent to requiring to be perpendicular to which is the Petrov-Galerkin condition.
-orthonormal basis
Let us introduce the notation , similar to our previous with meaning for a nonzero . Then, in much the same way that we can use Gram-Schmidt to produce an orthonormal basis out of a set of linearly independent vectors with
and , we can produce an -orthonormal basis with
and .
Working in a -orthonormal basis
If we have iteratively constructed an -orthonormal basis , then verifying the condition (11) simply requires
This is already fairly nice. Even nicer though is that constructing the -orthonormal basis with the "Krylov" direction choice (i.e. ) can be done with a simple recurrence as shown below.
The Gram-Schmidt step to obtain when is
In this development, we
expanded and made the un-normalized expression for appear with the normalizing constant,
used that the sum only has its last term non-zero (see below).
With the choice , we showed earlier that the vectors span the Krylov subspace
By definition of the Krylov subspace,
Since for , all terms in the sum vanish apart from the last one as announced.
Implementation
In the simple implementation below, we use three pairs of vectors in for storage as well as and (we could even reduce that). The important note is that the recurrence relation allows to compute the vectors iteratively without having to keep all of them in memory (unlike GMRES-style iterations which have linearly-growing memory requirements).
function conjgrad(A::Matrix, b::Vector, x0::Vector; niter=10)
x = copy(x0)
r = A * x .- b
norm_rk = zeros(niter+1)
norm_rk[1] = norm(r)
# We need to compute the first two steps
# before starting to use the recurrence
u1 = r
Au1 = A * u1
η1 = sqrt(dot(u1, Au1))
u1 /= η1
Au1 /= η1
α = -dot(r, u1)
x .+= α .* u1
r .+= α .* Au1
norm_rk[2] = norm(r)
u2 = r .- dot(r, Au1) .* u1
Au2 = A * u2
η2 = sqrt(dot(u2, Au2))
u2 /= η2
Au2 /= η2
α = -dot(r, u2)
x .+= α .* u2
r .+= α .* Au2
norm_rk[3] = norm(r)
u = copy(u2)
Au = copy(Au2)
for k = 1:niter-2
# Recurrence relation
u .= -(α * dot(Au2, Au1)) .* u1 .+ (η2 - dot(r, Au2)) .* u2 .+ (α .* Au2)
Au .= A * u
η = sqrt(dot(u, Au))
u ./= η
Au ./= η
α = -dot(r, u)
x .+= α .* u
r .+= α .* Au
η2 = η
u1 .= u2
Au1 .= Au2
u2 .= u
Au2 .= Au
norm_rk[3+k] = norm(r)
end
return x, norm_rk
end
In the implementation above, the step with dominant complexity is the matrix-vector multiplication to compute , everything else has negligible complexity in comparison.
Comparison
In the example below, we construct a simple and well-conditioned SPD system, and compare the various methods discussed so far. Again, this is not meant as a thorough benchmark but rather as a way to make a couple of interesting observations.
using IterativeSolvers
begin
rng = StableRNG(111)
# construct a simple SPD matrix
n = 100
A = randn(rng, n, n)
A = A'A + 2I
κA = round(cond(A), sigdigits=3)
println("Condition number κ(A): $κA")
x = randn(rng, n)
b = A * x
x0 = randn(rng, n)
xcg, nrk = conjgrad(A, b, copy(x0); niter=n)
res_krylov = itersolve(A, b, x0; niter=n, dirs=:krylov, criterion=:minres)
res_grad = itersolve(A, b, x0; niter=n, dirs=:grad, criterion=:minres)
res_fom_k = itersolve(A, b, x0; niter=n, dirs=:krylov, criterion=:fom)
res_fom_g = itersolve(A, b, x0; niter=n, dirs=:grad, criterion=:fom)
xcg, lcg = cg!(copy(x0), A, b; maxiter=n, log=true, reltol=0)
ncg = [norm(A*x0-b), lcg.data[:resnorm]...]
figure(figsize=(8, 6))
semilogy(nrk, label="CG", marker="v")
semilogy(res_krylov[2], label="GMRES (krylov)")
semilogy(res_grad[2], label="GMRES (grad)")
semilogy(res_fom_k[2], label="FOM (krylov)", linestyle="--")
semilogy(res_fom_g[2], label="FOM (grad)", linestyle="--")
semilogy(ncg, label="CG (IterativeSolvers)", marker="x")
xlabel("Number of iterations")
ylabel("Norm of the residuals")
legend()
end
Condition number κ(A): 194.0
On the graph above we can see a few things:
Our implementation of CG gives essentially the same results as that of the IterativeSolvers package which gives some confidence in the implementation of
conjgrad
in terms of correctness,GMRES and FOM with Krylov directions have a very similar behaviour and are better than CG for a large number of iterations,
GMRES and FOM with "Grad" directions have a very similar behaviour and are quite a bit worse than the other two.
Recall that all these methods require the same complexity per step (dominated by one matrix-vector product) with the exception of the ones using the grad direction (two matrix-vector products).
The first point above is the most important one at this point, the other two help highlight that it can be difficult to compare iterative methods in general. Consider for instance a very similar example but with much worse condition number:
begin
rng = StableRNG(111)
# construct a simple SPD matrix
n = 100
A = randn(rng, n, n)
A = (A'A + 2I)^4
κA = round(cond(A), sigdigits=3)
println("Condition number κ(A): $κA")
x = randn(rng, n)
b = A * x
x0 = randn(rng, n)
xcg, nrk = conjgrad(A, b, copy(x0); niter=n)
res_krylov = itersolve(A, b, x0; niter=n, dirs=:krylov, criterion=:minres)
res_grad = itersolve(A, b, x0; niter=n, dirs=:grad, criterion=:minres)
res_fom_k = itersolve(A, b, x0; niter=n, dirs=:krylov, criterion=:fom)
res_fom_g = itersolve(A, b, x0; niter=n, dirs=:grad, criterion=:fom)
xcg, lcg = cg!(copy(x0), A, b; maxiter=n, log=true, reltol=0)
ncg = [norm(A*x0-b), lcg.data[:resnorm]...]
figure(figsize=(8, 6))
semilogy(nrk, label="CG", marker="v")
semilogy(res_krylov[2], label="GMRES (krylov)")
semilogy(res_grad[2], label="GMRES (grad)")
semilogy(res_fom_k[2], label="FOM (krylov)", linestyle="--")
semilogy(res_fom_g[2], label="FOM (grad)", linestyle="--")
semilogy(ncg, label="CG (IterativeSolvers)", marker="x")
xlabel("Number of iterations")
ylabel("Norm of the residuals")
legend()
end
Condition number κ(A): 1.42e9
So in this case CG does poorly whereas the other methods still end up eventually converging (though requiring a large number of steps).
What we didn't discuss
In these two posts, the main goal was to re-obtain GMRES and CG from scratch and in a constructive fashion. There's however quite a few interesting things that we didn't discuss to try to keep the presentation not too cluttered, these include:
how these iterative methods relate to the Hessenberg decomposition where for an orthonormal matrix and a Hessenberg matrix (a matrix with for ),
how the iterative orthogonalisation procedure (Modified Gram-Schmidt) in the case of Krylov subspaces connect to methods for obtaining eigenvalues (Arnoldi iteration and Lanczos algorithm in the symmetric case) (see e.g. Strang (2006)),
how these iterative methods can suffer from stalling and how to deal with it, and what kind of guarantees they can offer,
the use of preconditioning to speed up convergence,
extensions such as BiCGStab van der Vorst (1992).
Short references
Saad and Schultz, GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems, 1986. – The original GMRES paper.
Saad, Iterative Methods for Sparse Linear Systems, 2003. – A reference book on iterative methods by one of the author of GMRES.
van der Vorst, Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems, 1992. – The original BiCGStab paper, unfortunately not in open access.
van der Vorst and Chan, Linear System Solvers: Sparse Iterative Methods, 1997. – A book chapter discussing iterative methods and, in particular, CG, GMRES etc.
Strang, Krylov Subspaces and Conjugate Gradients, 2006. – Lecture notes from Gil Strang on the topic.