GMRES & Conjugate Gradient (part II)

GMRES and Krylov subspaces
Petrov-Galerkin condition and FOM
1. (Rough) implementation
Conjugate gradient method
What we didn't discuss
Short references

GMRES and Krylov subspaces

Usually in the literature people present Krylov subspaces first and GMRES and related techniques after. Here we do the opposite as we came to GMRES in a different way. Starting from the iterative procedure, we now connect the dots with the Krylov subspaces.

For chronology's sake, note that Krylov subspaces where introduced by A. Krylov before the 1950s while GMRES was discussed by Y. Saad and M. Schultz in 1986 (Saad and Schultz (1986)).

Back to the present, we discussed iterative methods for solving \(Ax=b\) with \(A\) non singular and \(A\) and \(b\) with real entries:

\[ x_k \quad\!\! =\quad\!\! x_{k-1} + \alpha_k p_k \quad\!\! =\quad\!\! x_0 + \sum_{i=1}^k \alpha_i p_i , \]

for some linearly independent \(p_i\) and coefficients \(\alpha_i\). One of the choice of directions is \(p_k=r_{k-1}\), which corresponds to the classical GMRES algorithm.

Observe that, for this choice, we have \(x_k=x_{k-1}+\alpha_kr_{k-1}\) and

\[\begin{array}{rcl} r_k \quad\!\! =\quad\!\! Ax_k - b &=& r_{k-1} + \alpha_k Ar_{k-1} \\ &=& (I + \alpha_kA) r_{k-1} \\ &=& (I + \alpha_kA)(I +\alpha_{k-1}A)\dots(I+\alpha_1A)r_0\end{array}\]

Expanding that product, we could find some \(\gamma_1,\dots,\gamma_k\) such that

\[ r_k \quad\!\! =\quad\!\! r_0 + \sum_{i=1}^k \gamma_iA^ir_0. \]

The coefficients \(\gamma_i\) don't matter, what does matter is that this is a linear combination of the vectors \(A^{i}r_0\) for \(i=0,\dots,k\) which span the Krylov subspace \(\mathcal K_{k+1}\) with

\[ \mathcal K_k(A, r_0) \quad\!\! =\quad\!\! \mathrm{span}\{r_0, Ar_0, \dots, A^{k-1}r_0\}, \]

we will just write \(\mathcal K_k\) when \(A\) and \(r_0\) are obvious from the context.

Since the \(r_{k-1}\) and therefore the \(p_k\) each live in \(\mathcal K_k\), the \(x_k\) given by (1) are also in \(\mathcal K_k\). The iteration can then be interpreted as iteratively attempting to represent \(x\) into successive Krylov subspaces.

Standard GMRES corresponds to the choice \(p_k=r_k\) with the minimum residual criterion (cf. previous post). Overall, this can now be re-expressed as:

\[ x_{k+1} - x_0 \quad\!\! =\quad\!\! \arg \min_{z \in \mathcal K_{k}}\,\, \| A(x_0+z) - b \|_2. \]

Petrov-Galerkin condition and FOM

In the previous post, we mostly discussed the choice of directions \(p_k\) and not the criterion for choosing the approximation \(x_k\) which we fixed at the minimum residual criterion.

A different criterion that can be considered is the Petrov-Galerkin condition requiring the residual \(r_k\) to be orthogonal to the space spanned by the directions \(\{p_1, \dots, p_k\}\). This condition leads to the Full Orthogonalisation Method or FOM Saad (2003).

For generic matrices, this method does not perform very well, but for specific matrices, it leads to the very effective Conjugate Gradient method as will be seen in the next point.

If we denote \(\{q_1,\dots,q_k\}\) the orthonormal basis corresponding to the \(p_k\), the Petrov-Galerkin condition amounts to \(\left\langle r_k, q_i\right\rangle=0\) for \(i=1,\dots,k\) which, in matrix form, reads

\[ (Q^{(k)})^tr_k \quad\!\! =\quad\!\! 0. \]

where \(Q^{(k)}\) is the \(n\times k\) matrix with columns \(\{q_1,\dots,q_k\}\). Following the same approach as in the discussion of the minres criterion, we can iteratively form \(\tilde{Q}^{(k)}\) such that \(AQ^{(k)} = \tilde{Q}^{(k)}R^{(k)}\). With \(x_k = x_0 + \sum_{i=1}^{k-1}\beta_i A q_i\) for some \(\beta_i\) to determine and letting \(\beta^{(k)}\) the vector with components \(\{\beta_1,\dots,\beta_k\}\), the condition then reads

\[ (Q^{(k)})^t\tilde{Q}^{(k)}R^{(k)}\beta^{(k)} \quad\!\! =\quad\!\! -(Q^{(k)})^tr_0. \]

Like in the minres case, this is just a regression problem in \(\mathbb R^k\) and some tricks can be applied to solve this cheaply. We dive much deeper though as, in the general case, FOM behaves much worse than GMRES as is illustrated below in a rough benchmark.

(Rough) implementation

Most of the code here is drawn from the previous post, we simplify the resolution of the least-square problem to get the \(\beta\) using the generic backslash operator (we shouldn't but it makes the code simpler). The main goal here is to get a quick comparison of using the minres vs the Petrov-Galerkin criterion.

using LinearAlgebra, StableRNGs

function mgs!(Q::Matrix, R::Matrix, v::Vector, k::Int)
    n = length(v)
    Q[:, k] .= v
    for i = 1:k-1
        R[i, k] = dot(Q[:, k], Q[:, i])
        Q[:, k] .-= R[i, k] .* Q[:, i]
    end
    Q[:, k] = normalize(Q[:,k])
    R[k, k] = dot(v, Q[:, k])
    return
end

function itersolve(A::Matrix, b::Vector, x0::Vector;
                   niter::Int=10, dirs=:rand, criterion=:minres)
    start = time()
    n = size(A, 1)
    r0 = A * x0 - b

    xk = copy(x0)
    rk = copy(r0)
    pk = zeros(n)
    Q = zeros(n, niter)
    Q̃ = zeros(n, niter)
    R = zeros(niter, niter)
    M = zeros(niter, niter)
    norm_rk = zeros(niter + 1)
    norm_rk[1] = norm(r0)
    times = zeros(niter+1)
    times[1] = time() - start

   #@views for k = 1:niter
   for k = 1:niter
       if dirs == :rand
           pk = randn(n)
       elseif dirs == :krylov
           pk = rk
       elseif dirs == :grad
           pk = A' * rk
       end
       mgs!(Q, R, pk, k)
       mgs!(Q̃, R, A*Q[:, k], k)

       if criterion == :minres
           γk = -Q̃[:, 1:k]' * r0
           βk = R[1:k, 1:k] \ γk
       elseif criterion == :fom
           # compute M = (Q'Q̃) iteratively
           if k > 1
               M[1:k-1, k] .= Q[:, 1:k-1]' * Q̃[:, k]
               M[k, 1:k-1] .= vec(Q[:, k]' * Q̃[:, 1:k-1])
           end
           M[k, k] = dot(Q[:, k], Q̃[:, k])
           # solve (Q'Q̃R)β ≈ (-Q'r0)
           βk = (M[1:k, 1:k] * R[1:k, 1:k]) \ (-Q[:, 1:k]' * r0)
       end
       xk .= x0 .+ Q[:, 1:k] * βk
       rk .= A * xk .- b
       norm_rk[k+1] = norm(rk)
       times[k+1] = time() - start
   end
   return xk, norm_rk, times
end

itersolve (generic function with 1 method)

As usual, the code above is not optimised, particularly at places S1 and S2, a dedicated procedure should be used rather than the generic backslash operator.

Our way to compute the coefficients corresponding to the FOM criterion is very crude but our goal here is simply to have a quick comparison with the other.

using PyPlot

rng = StableRNG(908)

n = 100
A = randn(rng, n, n)^2 .+ 5 * rand(rng, n, n)
x = randn(rng, n)
b = A * x
x0 = randn(rng, n)
r0 = A * x0 - b

figure(figsize=(8, 6))
for criterion in (:minres, :fom) # (:minres, :fom)
    for dirs in (:rand, :grad, :krylov)
        res = itersolve(A, b, x0; niter=n, dirs=dirs, criterion=criterion)
        semilogy(res[2], label="$(dirs)_$(criterion)")
    end
end
xlabel("Number of iterations")
ylabel("Norm of the residuals")
legend()

As can be seen on the graph above, FOM can have a much less smooth convergence and GMRES is usually preferred to it (at least to the naive form presented above).

Conjugate gradient method

While FOM typically underperforms GMRES-style updates, in the case where \(A\) is symmetric positive definite (SPD) it can be modified to lead to a very useful and efficient algorithm: the conjugate gradient method.

For the rest of this section, \(A\) is assumed to be SPD i.e. \(\left\langle x, Ax\right\rangle = \left\langle Ax, x\right\rangle > 0\) for any \(x\neq 0\). Recall that if \(A\) is SPD it induces an inner-product on \(\mathbb R^n\) with

\[ \left\langle x, y\right\rangle_A \quad\!\! =\quad\!\! \left\langle x, Ay\right\rangle. \]

Indeed, it is trivially bilinear, symmetric since \(\left\langle x, y\right\rangle_A = \left\langle x, Ay\right\rangle = \left\langle Ax, y\right\rangle = \left\langle y, x\right\rangle_A\) and positive definite since \(A\) is positive definite. Correspondingly, we can define a norm \(\|x\|_A^2 = \left\langle x, Ax\right\rangle\).

Petrov-Galerkin condition again

Let's start with the basic iteration again, assuming we have an iteratively constructed set of linearly-independent \(\{p_1,\dots,p_k\}\) which we express in some base \(\{u_1,\dots,u_k\}\). Then, the iteration amount to

\[ \begin{cases} x_k &=& x_0 + \sum_{i=1}^k \alpha_k u_k \\ r_k &=& r_0 + \sum_{i=1}^k \alpha_k Au_k \end{cases} \]

with the \(\alpha_i\) such that \(r_k\) is perpendicular to the space spanned by the \(p_k\) (and so the \(u_k\)) i.e.:

\[ \left\langle r_k, u_j\right\rangle \quad\!\! =\quad\!\! 0 \quad \text{for}\quad j=1,\dots,k. \]

Now since \(r_k = r_{k-1} + \alpha_k Au_k\), and that, by construction, \(r_{k-1}\) must be orthogonal to \(u_j\) for \(1,\dots,k-1\), the condition becomes

\[\begin{array}{rcl} \left\langle Au_k, u_j\right\rangle &=& 0, \quad\text{for}\quad j=1,\dots,k-1 \\ \left\langle r_{k-1}, u_k\right\rangle + \alpha_k\left\langle Au_k, u_k\right\rangle &=& 0 \end{array}\]

In the case where \(A\) is SPD, we can construct a base \(\{u_1,\dots,u_k\}\) to make the above condition particularly simple to verify. Before doing that, we give an alternative interpretation of using the PG condition.

Correspondance between Petrov-Galerkin and \(A\)-projections

With the choice \(p_k=r_{k-1}\), the \(\{p_1,\dots,p_k\}\) span the Krylov subspace \(\mathcal K_k(A, r_0)\), and the orthogonality condition requires that \(r_k\) be orthogonal to \(\mathcal K_k\). When \(A\) is SPD, we show below that the iterations with the PG condition are equivalent to the following iterations:

\[\begin{array}{rcl} x_{k} - x_0 &=& \arg\displaystyle{\min_{z\in\mathcal K_k}} \,\, \|\overline{x}-z\|_A^2\\\end{array}\]

where \(\overline{x} = x-x_0\) and \(\|u\|^2_A=\left\langle u,Au\right\rangle\). In other words, a sequence of projections of \((x-x_0)\) onto \(\mathcal K_k\) in the geometry induced by \(A\) ("\(A\)-projections").

In order to show the equivalence, let's expand the \(A\)-norm and drop the term that doesn't depend on \(z\):

\[ \arg\min_{z \in \mathcal K_k} \,\, \left\langle z, Az\right\rangle - \left\langle \overline{x}, Az\right\rangle. \]

This is a convex minimisation problem since the objective is convex (as \(A\) is SPD) over a convex set \(\mathcal K_k\) (convex since it's an affine subspace). We can thus consider the first order optimality condition (FOC) of the problem. The gradient of the objective function is \(2A(z-\overline{x})\) and the FOC at the minimiser \(z^\dagger\) is:

\[ -2A(z^\dagger -\overline{x}) \quad\!\! \in\quad\!\! \{ y \in\mathbb R^n\,\, |\, \left\langle w - z^\dagger, y\right\rangle \le 0, \,\, \forall w\in\mathcal K_k \} \]

where the left-hand-side is the normal cone to \(\mathcal K_k\). This amounts to requiring

\[ \left\langle w - z^\dagger, A(\overline{x} - z)\right\rangle \quad\!\! \le\quad\!\! 0, \,\,\, \forall w \in \mathcal K_k. \]

Since both \(w\) and \(z^\dagger\) are in \(\mathcal K_k\), we can let \(u=w-z^\dagger\) and require \(\left\langle \overline{x}-z^\dagger, Au\right\rangle \le 0\) for any \(u \in \mathcal K_k\). For this to always hold we must have \((\overline{x}-z^\dagger)\) be \(A\)-orthogonal to \(\mathcal K_k\) or:

\[ \left\langle \overline{x} - z^\dagger, Au\right\rangle \quad\!\! =\quad\!\! 0\,\,\,\forall u\in\mathcal K_k. \]

Equivalently, \(\left\langle A\overline{x}-Az^\dagger, u\right\rangle=0\) for all \(u \in \mathcal K_k\) or

\[ \left\langle Ax - A(x_0+z^\dagger), u\right\rangle \quad\!\! =\quad\!\! 0 \,\,\,\forall u\in\mathcal K_k. \]

with \(x_{k}=x_0+z^\dagger\), \(Ax=b\) and \(r_k=Ax_k-b\), the above equation is equivalent to requiring \(r_k\) to be perpendicular to \(\mathcal K_k\) which is the Petrov-Galerkin condition.

Note if you're not familiar with the FOC with the normal cone, you can consider this post. Another way to put the condition (which you might have seen in the context of the KKT conditions) is that, at a minimiser of a convex constrained minimisation problem, either the gradient is zero at that point or the gradient is perpendicular to the boundary of the domain at that point.

\(A\)-orthonormal basis

Let us introduce the notation \(\propto_A\), similar to our previous \(\propto\) with \(x\propto_A y\) meaning \(x = y/\|y\|_A\) for a nonzero \(y\). Then, in much the same way that we can use Gram-Schmidt to produce an orthonormal basis \(\{q_1,\dots,q_k\}\) out of a set of linearly independent vectors \(\{p_1, \dots, p_k\}\) with

\[ q_k \quad\!\! \propto\quad\!\! p_k - \sum_{i=1}^{k-1} \left\langle p_k, q_i\right\rangle q_i \]

and \(\left\langle q_i,q_j\right\rangle = \delta_{ij}\), we can produce an \(A\)-orthonormal basis \(\{u_1,\dots,u_k\}\) with

\[ u_k \quad\!\! \propto_A\quad\!\! p_k - \sum_{i=1}^{k-1} \left\langle p_k, q_i\right\rangle_A q_i \]

and \(\left\langle u_i, u_j\right\rangle_A = \delta_{ij}\).

Working in a \(A\)-orthonormal basis

If we have iteratively constructed an \(A\)-orthonormal basis \(\{u_1,\dots,u_k\}\), then verifying the condition (11) simply requires

\[ \alpha_k \quad\!\! =\quad\!\! -\left\langle r_{k-1}, u_k\right\rangle. \]

This is already fairly nice. Even nicer though is that constructing the \(A\)-orthonormal basis with the "Krylov" direction choice (i.e. \(p_k=r_{k-1}\)) can be done with a simple recurrence as shown below.

The Gram-Schmidt step to obtain \(u_{k+1}\) when \(p_{k+1}=r_{k}\) is

\[\begin{array}{rcl} u_{k+1} &\propto_A& \displaystyle{r_{k} - \sum_{j=1}^{k} \left\langle r_{k}, u_j\right\rangle_A u_j} \\ &\propto_A& \displaystyle{r_{k-1} + \alpha_k A u_k - \sum_{j=1}^{k-1}\left\langle r_{k}, u_j\right\rangle_A u_j - \left\langle r_{k}, u_{k}\right\rangle_A u_{k}} \\ &\propto_A& \displaystyle{\textcolor{blue}{r_{k-1}} + \alpha_k A u_k \textcolor{blue}{- \sum_{j=1}^{k-1}\left\langle r_{k-1}, u_j\right\rangle_A u_j} - \textcolor{darkgreen}{\alpha_k\sum_{j=1}^{k-1}\left\langle Au_k, u_j\right\rangle_A u_j} - \left\langle r_{k}, u_{k}\right\rangle_A u_{k}} \\ &\propto_A& \displaystyle{\textcolor{blue}{\eta_{k}u_{k}} + \alpha_kAu_k - \textcolor{darkgreen}{\alpha_k\left\langle Au_k, u_{k-1}\right\rangle_A u_{k-1}} - \left\langle r_k, u_k\right\rangle_A u_k} \\ &\propto_A& \displaystyle{-\alpha_k\left\langle Au_k, u_{k-1}\right\rangle u_{k-1} + (\eta_k - \left\langle r_k, u_k\right\rangle)u_k + \alpha_k Au_k}.\end{array}\]

In this development, we

expanded \(r_k = r_{k-1} + \alpha_kAu_k\) and made the un-normalized expression for \(u_{k}\) appear with \(\eta_{k}\) the normalizing constant,
used that the sum \(\sum_{j=1}^{k-1}\left\langle Au_k, u_j\right\rangle_A u_j\) only has its last term non-zero (see below).

With the choice \(p_k=r_{k-1}\), we showed earlier that the vectors \(\{p_1, \dots, p_{k}\}\) span the Krylov subspace \(\mathcal K_k(A, r_0)\). By construction, \(\{u_1,\dots,u_k\}\) is a basis for \(\mathcal K_k\), and \(u_k\) is \(A\)-orthogonal to \(\mathcal K_{k-1}\), meaning that \(\left\langle u_k, v\right\rangle_A = 0\) for any \(v\in\mathcal K_{k-1}\).

By definition of the Krylov subspace, \(\mathcal K_{k-1} = A\mathcal K_{k-2} \cup \mathrm{span}\{r_0\}\), and so, for any \(v \in \mathcal K_{k-2}\), we have

\[ \left\langle u_k, Av\right\rangle_A \quad\!\! =\quad\!\! 0. \]

Since \(u_j\in \mathcal K_{k-2}\) for \(1\le j\le k-2\), all terms in the sum \(\sum_{j=1}^{k-1} \left\langle Au_k, u_j\right\rangle_A u_j\) vanish apart from the last one as announced.

Implementation

In the simple implementation below, we use three pairs \((u, Au)\) of vectors in \(\mathbb R^n\) for storage as well as \(x\) and \(r\) (we could even reduce that). The important note is that the recurrence relation allows to compute the vectors \(u_k\) iteratively without having to keep all of them in memory (unlike GMRES-style iterations which have linearly-growing memory requirements).

function conjgrad(A::Matrix, b::Vector, x0::Vector; niter=10)
    x = copy(x0)
    r = A * x .- b

    norm_rk    = zeros(niter+1)
    norm_rk[1] = norm(r)

    # We need to compute the first two steps
    # before starting to use the recurrence

    u1   = r
    Au1  = A * u1
    η1   = sqrt(dot(u1, Au1))
    u1  /= η1
    Au1 /= η1
    α    = -dot(r, u1)
    x  .+= α .* u1
    r  .+= α .* Au1

    norm_rk[2] = norm(r)

    u2  = r .- dot(r, Au1) .* u1
    Au2 = A * u2
    η2  = sqrt(dot(u2, Au2))
    u2  /= η2
    Au2 /= η2
    α    = -dot(r, u2)
    x  .+= α .* u2
    r  .+= α .* Au2

    norm_rk[3] = norm(r)

    u  = copy(u2)
    Au = copy(Au2)

    for k = 1:niter-2
        # Recurrence relation
        u   .= -(α * dot(Au2, Au1)) .* u1 .+ (η2 - dot(r, Au2)) .* u2 .+ (α .* Au2)
        Au  .= A * u
        η    = sqrt(dot(u, Au))
        u  ./= η
        Au ./= η

        α   = -dot(r, u)
        x .+= α .* u
        r .+= α .* Au

        η2    = η
        u1   .= u2
        Au1  .= Au2
        u2   .= u
        Au2  .= Au

        norm_rk[3+k] = norm(r)
    end
    return x, norm_rk
end

In the implementation above, the step with dominant complexity is the matrix-vector multiplication to compute \(Au\), everything else has negligible complexity in comparison.

Comparison

In the example below, we construct a simple and well-conditioned SPD system, and compare the various methods discussed so far. Again, this is not meant as a thorough benchmark but rather as a way to make a couple of interesting observations.

using IterativeSolvers

begin
    rng = StableRNG(111)

    # construct a simple SPD matrix
    n = 100
    A = randn(rng, n, n)
    A = A'A + 2I

    κA = round(cond(A), sigdigits=3)
    println("Condition number κ(A): $κA")

    x = randn(rng, n)
    b = A * x
    x0 = randn(rng, n)

    xcg, nrk = conjgrad(A, b, copy(x0); niter=n)
    res_krylov = itersolve(A, b, x0; niter=n, dirs=:krylov, criterion=:minres)
    res_grad = itersolve(A, b, x0; niter=n, dirs=:grad, criterion=:minres)
    res_fom_k = itersolve(A, b, x0; niter=n, dirs=:krylov, criterion=:fom)
    res_fom_g = itersolve(A, b, x0; niter=n, dirs=:grad, criterion=:fom)

    xcg, lcg = cg!(copy(x0), A, b; maxiter=n, log=true, reltol=0)
    ncg = [norm(A*x0-b), lcg.data[:resnorm]...]

    figure(figsize=(8, 6))
    semilogy(nrk, label="CG", marker="v")
    semilogy(res_krylov[2], label="GMRES (krylov)")
    semilogy(res_grad[2], label="GMRES (grad)")
    semilogy(res_fom_k[2], label="FOM (krylov)", linestyle="--")
    semilogy(res_fom_g[2], label="FOM (grad)", linestyle="--")
    semilogy(ncg, label="CG (IterativeSolvers)", marker="x")
    xlabel("Number of iterations")
    ylabel("Norm of the residuals")
    legend()
end

Condition number κ(A): 194.0

On the graph above we can see a few things:

Our implementation of CG gives essentially the same results as that of the IterativeSolvers package which gives some confidence in the implementation of conjgrad in terms of correctness,
GMRES and FOM with Krylov directions have a very similar behaviour and are better than CG for a large number of iterations,
GMRES and FOM with "Grad" directions have a very similar behaviour and are quite a bit worse than the other two.

Recall that all these methods require the same complexity per step (dominated by one matrix-vector product) with the exception of the ones using the grad direction (two matrix-vector products).

The first point above is the most important one at this point, the other two help highlight that it can be difficult to compare iterative methods in general. Consider for instance a very similar example but with much worse condition number:

begin
    rng = StableRNG(111)

    # construct a simple SPD matrix
    n = 100
    A = randn(rng, n, n)
    A = (A'A + 2I)^4

    κA = round(cond(A), sigdigits=3)
    println("Condition number κ(A): $κA")

    x = randn(rng, n)
    b = A * x
    x0 = randn(rng, n)

    xcg, nrk = conjgrad(A, b, copy(x0); niter=n)
    res_krylov = itersolve(A, b, x0; niter=n, dirs=:krylov, criterion=:minres)
    res_grad = itersolve(A, b, x0; niter=n, dirs=:grad, criterion=:minres)
    res_fom_k = itersolve(A, b, x0; niter=n, dirs=:krylov, criterion=:fom)
    res_fom_g = itersolve(A, b, x0; niter=n, dirs=:grad, criterion=:fom)

    xcg, lcg = cg!(copy(x0), A, b; maxiter=n, log=true, reltol=0)
    ncg = [norm(A*x0-b), lcg.data[:resnorm]...]

    figure(figsize=(8, 6))
    semilogy(nrk, label="CG", marker="v")
    semilogy(res_krylov[2], label="GMRES (krylov)")
    semilogy(res_grad[2], label="GMRES (grad)")
    semilogy(res_fom_k[2], label="FOM (krylov)", linestyle="--")
    semilogy(res_fom_g[2], label="FOM (grad)", linestyle="--")
    semilogy(ncg, label="CG (IterativeSolvers)", marker="x")
    xlabel("Number of iterations")
    ylabel("Norm of the residuals")
    legend()
end

Condition number κ(A): 1.42e9

So in this case CG does poorly whereas the other methods still end up eventually converging (though requiring a large number of steps).

What we didn't discuss

In these two posts, the main goal was to re-obtain GMRES and CG from scratch and in a constructive fashion. There's however quite a few interesting things that we didn't discuss to try to keep the presentation not too cluttered, these include:

how these iterative methods relate to the Hessenberg decomposition where \(A=Q'HQ\) for an orthonormal matrix \(Q\) and a Hessenberg matrix \(H\) (a matrix with \(h_{ij}=0\) for \(i > j+1\)),
how the iterative orthogonalisation procedure (Modified Gram-Schmidt) in the case of Krylov subspaces connect to methods for obtaining eigenvalues (Arnoldi iteration and Lanczos algorithm in the symmetric case) (see e.g. Strang (2006)),
how these iterative methods can suffer from stalling and how to deal with it, and what kind of guarantees they can offer,
the use of preconditioning to speed up convergence,
extensions such as BiCGStab van der Vorst (1992).

Short references

Saad and Schultz, GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems, 1986. – The original GMRES paper.
Saad, Iterative Methods for Sparse Linear Systems, 2003. – A reference book on iterative methods by one of the author of GMRES.
van der Vorst, Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems, 1992. – The original BiCGStab paper, unfortunately not in open access.
van der Vorst and Chan, Linear System Solvers: Sparse Iterative Methods, 1997. – A book chapter discussing iterative methods and, in particular, CG, GMRES etc.
Strang, Krylov Subspaces and Conjugate Gradients, 2006. – Lecture notes from Gil Strang on the topic.