Abstract
Recent research efforts have shown that Jacobi and block-Jacobi relaxation methods can be used as an effective and highly parallel approach for the solution of sparse triangular linear systems arising in the application of ILU-type preconditioners. Simultaneously, a few independent works have focused on designing efficient high performance adaptive-precision block-Jacobi preconditioning (block-diagonal scaling), in the context of the iterative solution of sparse linear systems, on manycore architectures. In this paper, we bridge the gap between relaxation methods based on regular splittings and preconditioners by demonstrating that iterative refinement can be leveraged to construct a relaxation method from the preconditioner. In addition, we exploit this insight to construct a highly-efficient sparse triangular system solver for graphics processors that combines iterative refinement with the block-Jacobi preconditioner available in the Ginkgo library.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others

1 Introduction
A significant number of today’s leader high performance computing systems integrate hardware accelerators, such as graphics processing units (GPUs), with hundreds to thousands of arithmetic units. In consequence, there is a strong urge to extract as much parallelism as possible from this type of platforms when implementing numerical algorithms to satisfy the ever-increasing computational demands of complex simulations. Furthermore, as memory traffic is much more expensive than computation in current systems, minimizing the overhead due to memory accesses is crucial in order to implement well performing algorithms.
Keeping these aspects in mind, in this paper we aim to combine efforts in both directions –exploit hardware parallelism efficiently while tackling the memory bandwidth bottleneck– to construct highly efficient iterative solvers for sparse triangular (linear) systems.
Previous research has shown that block-Jacobi relaxation provides a favourable means to exploit parallelism for the solution of sparse triangular systems arising in incomplete factorization preconditioning [4] when approximate solutions are acceptable. Independently, it has also been found that, for certain type of problems, using adaptive-precision in block-Jacobi preconditioning [3] can drastically reduce memory traffic and, therefore, runtime.
In this paper, we make the following contributions by combining the insights gained from in [4] and [3]:
-
We propose an alternative approach to derive relaxation methods based on regular matrix splittings. This enables us to establish an explicit relation between iterative refinement, preconditioners, and relaxation methods (Sect. 2).
-
Furthermore, we exploit this link in practice, by leveraging the highly optimized adaptive-precision block-Jacobi preconditioner and the iterative refinement components from the GinkgoFootnote 1 open source library in order to assemble an adaptive-precision block-Jacobi relaxation method (Sect. 3).
-
In addition, we employ this relaxation method as triangular solver for the Incomplete Cholesky (IC) preconditioner in a Conjugate Gradient (ICCG, Sect. 3).
-
Finally, we evaluate the efficiency and effectiveness of the presented methods by testing them on a selection of matrices from the SuiteSparse Matrix Collection [1] (Sect. 4). Concretely, we compare the performance of our adaptive-precision block-Jacobi implementation against a fixed precision block-Jacobi relaxation method as well as Ginkgo’s direct triangular solvers.
The main goal of this paper is to demonstrate the benefits of our adaptive-precision block-Jacobi approach versus a fixed-precision block-Jacobi relaxation. For this reason, we choose test problems for which an iterative solver is a valid choice for the triangular solves appearing in incomplete factorization preconditioning. We recognize that the applicability of our approach remains problem dependent [4], and there are cases where it will not provide an efficient alternative, particularly because the iterative triangular system solvers do not converge quickly.
2 Background
Consider the linear system \(Ax=b\), where A is an \(n \times n\) input matrix, b is the input right-hand side vector, comprising n components, and x is the sought-after solution, also with n components. The iterative refinement method (IR) for this linear system is then defined as the recurrence [7]:
where \(x_k\) is the current approximation to the solution x; \(r_k\) is the current residual; \({{\,\mathrm{Solve}\,}}(A, b)\) denotes a “coarse” solver that provides an approximated solution to \(Ax = b\); and \(q_k\) corresponds to the approximation of the error \(x - x_k\) obtained using the coarse solver. This method can be also expressed as a single equation:
which is the formulation that will be used in the remainder of this work.
Preconditioning refers to replacing the system \(Ax = b\) with an equivalent counterpart \(M^{-1}Ax = M^{-1}b\) (a variant known as left preconditioning) in the hope of improving the numerical properties of the transformed system matrix (\(M^{-1}A\)), which in turn accelerates the convergence of the iterative method used to tackle the transformed system. Applying preconditioning to IR, by replacing every occurrence of A with \(M^{-1}A\) and those of b with \(M^{-1}b\), yields the preconditioned iterative refinement method (PIR):
Equation (6) can be viewed as a variant of IR where the coarse method \({{\,\mathrm{Solve}\,}}\) is replaced with its preconditioned variant \({{\,\mathrm{Solve}\,}}_M\), which tackles the system \(Ax = b\) by applying \({{\,\mathrm{Solve}\,}}\) to the transformed (preconditioned) system \(M^{-1}Ax = M^{-1}b\):
The convergence rate of IR (or PIR) is directly tied to the accuracy of \({{\,\mathrm{Solve}\,}}\) (or \({{\,\mathrm{Solve}\,}}_M\)). In particular, if the relative accuracy of \({{\,\mathrm{Solve}\,}}\) is given by a parameter \(\delta \), (i.e., \(\Vert \hat{q_k} - q_k \Vert \le \delta \Vert \hat{q_k} \Vert \), where \(\hat{q_k}\) denotes the true solution of the system \(A\hat{q_k} = r_k\),) then the errors of two consecutive approximations to the solution x satisfy:
This is a direct consequence of the following two equalities:
A simple method to approximate the solution of the system consists in using \({{\,\mathrm{Solve}\,}}(A, b) := b\); that is, the solution is approximated by the right-hand side vector of the system b. The preconditioned version of this method is therefore obtained by applying the same reasoning to the system \(M^{-1}A x = M^{-1}b\), which results in \({{\,\mathrm{Solve}\,}}_M(A, b) := M^{-1}b\). That is, the solution is obtained from the preconditioner application to the right-hand side vector. Since the preconditioner application is one of the most crucial kernels involving preconditioners, an added bonus for this approach is that sparse linear algebra libraries usually include highly optimized kernels for its computation.
Using the definition of \({{\,\mathrm{Solve}\,}}_M(A, b):= M^{-1}b\) in PIR from Eq. (7) results in the method:
where the matrix \(N := M - A\) satisfies that \(A = M - N\).
A number of relevant observations follow from the previous elaboration:
-
Eq. (11) is exactly the definition of a relaxation method induced by the regular splitting \(A = M - N\) [9]. Thus, every such method can be expressed as an IR whose inner solver is \({{\,\mathrm{Solve}\,}}(A, b) := b\) preconditioned with the matrix M from the regular splitting.
-
Additionally, a PIR method preconditioned with \(M = A - R\), where R is the preconditioner’s residual matrix, induces a relaxation method \(x_{k+1} := M^{-1}R x_k + M^{-1}b\).
-
Finally, Eq. (8) demonstrates that the convergence rate of such methods is given by the parameter \(\delta \) of the preconditioner M:
$$\begin{aligned} \delta&:= \max _{y \in \mathbb {R}^n} \frac{\Vert M^{-1}y - A^{-1}y \Vert }{\Vert A^{-1}y \Vert }, \quad y \ne 0. \end{aligned}$$(12)
3 Assemblinag the Algorithm with Ginkgo’s Components
The open-source Ginkgo linear algebra library provides high-performance implementations of popular Krylov methods and preconditioners. Ginkgo also includes efficient realizations of IR, particularly for manycore accelerators such as GPUs. While there is (yet) no direct support for the block-Jacobi relaxation method in Ginkgo, the library is equipped with a highly optimized implementation of the block-Jacobi preconditioner. This implementation also integrates an adaptive-precision storage scheme [5], which reduces the total volume of data fetched from memory at each iteration, while preserving the quality of the preconditioner measured by \(\delta \).
The previous section showed that a relaxation method can be easily constructed by combining IR with a preconditioner. Concretely, the adaptive-precision block-Jacobi preconditioner from Ginkgo yields an adaptive-precision version of the block-Jacobi relaxation method. While the full-precision block-Jacobi relaxation method proved effectiveness in providing an approximate solution for sparse triangular systems arising from incomplete LU (ILU)-type preconditioners [4], in this work we focus on the method automatically constructed from the preconditioner in Ginkgo, which allows seamless enhancements via the adaptive-precision storage scheme. The code used to construct this method is illustrated in Listing 1.1.

Note that, with the new approach, we have to perform a matrix-vector multiplication with \(M^{-1}\) and a second matrix-vector product with A (per iteration), whereas the classical block-Jacobi method involves one matrix-vector multiplication with N and second with \(M^{-1}\) (per iteration). Since in the block-Jacobi method M is a block-diagonal submatrix of A, \(N = M - A\) will generally have less nonzero elements than A. This means that the IR-induced formulation can be expected to perform more floating-point operations per iteration than the classical block-Jacobi method. However, the IR approach can leverage Ginkgo’s highly optimized block-Jacobi kernels, including their adaptive-precision storage realization. Furthermore, as presented in Listing 1.1, the development effort is minimal.
Ginkgo also offers a direct means to generate an ILU-type preconditioner, with IR as inner solver, and integrate the generated preconditioner into a Krylov solver in order to obtain a general sparse linear system solver. Listing 1.2 shows the complete code needed to tackle a sparse symmetric positive definite (s.p.d.) system with a CG solver preconditioned with an ILU/incomplete Cholesky (IC) decomposition using the presented adaptive-precision IR approach for solving the triangular systems occurring at each iteration when applying the preconditioner.
In the next section, we compare the efficiency of the new IR-induced adaptive-precision block-Jacobi solver for solving triangular systems against using direct triangular solves as well as the IR-induced block-Jacobi method operating in fixed (double) precision.

4 Experimental Evaluation
4.1 Test Problems
This section collects results for 24 s.p.d. matrices from the SuiteSparse Matrix Collection [1] arising in real applications or artificial academic problems. The test matrices are listed in Table 1. For the block-Jacobi preconditioner, we use supervariable amalgamation [4] for identifying strongly connected components with a maximum block size of 16.
function.4.2 Hardware Setup
All results were collected on an NVIDIA Volta V100 GPU placed in the Summit supercomputer. This GPU contains 16 GB of DDR4 memory and 80 streaming multiprocessors with 32 double precision units each. The hardware specifications show that peak performance for double precision operations is 7.8 TFLOP/s (i.e., \(7.8 \cdot 10^{12}\) floating-point operations per second) and the bandwidth to access the main memory is 900 GB/s [8].
All results were gathered using the CUDA Toolkit 10.1.243 and Ginkgo 1.1.1. In particular, when we compare against exact triangular solves, Ginkgo interfaces to the cuBLAS routine csrsm2.
As our complete codes run on the GPU, the features of the platform (CPU, main memory, PCI bus bandwidth, etc.) where the GPU resides are not relevant for the following analysis.
Sparsity patterns of selected test matrices.
4.3 Numerical Experiments on HPC GPU
Initially, we want to investigate whether an iterative triangular solver based on block-Jacobi can be faster than a conventional direct triangular solver. For that purpose, we consider five sparse problems, finan512, Dubcova3, thermal1, ted_B and Muu, coming from different application scenarios; see Table 1 for the matrix characteristics and Fig. 1 for the sparsity patterns.
The results for this first experiment are shown in in Fig. 2. The left-hand side column of plots in that figure displays the number of iterations a Conjugate Gradient method preconditioned with Incomplete Cholesky (ICCG) needs to converge depending on how the triangular systems are solved: either using exact triangular solves or, alternatively, block-Jacobi inside an IR method. Here, the ICCG method is considered to have converged when the norm of the relative residual
is less than \(1e-16 \cdot \text {cond}\), where \(\text {cond}\) is the estimated condition number of A found in Table 1. As we can observe, when the number of sweeps of block-Jacobi IR per triangular solve is low, there is a significant increase in terms of ICCG iterations. When using more than 5 block-Jacobi IR sweeps though, the increase of ICCG iterations shrinks to a moderate level. A large number of block-Jacobi sweeps fully compensates the approximate characteristics, and retains the ICCG iteration count observed for the variant using exact triangular solves. We note that, in terms of convergence of the ICCG solver, there is no relevant difference between the realizations that integrate the adaptive-precision block-Jacobi and the fixed-precision block-Jacobi.
Relating the effect of the block-Jacobi sweep count in the iterative triangular solution in incomplete factorization preconditioning for the matrix problems finan512, Dubcova3, thermal1, ted_B and Muu (from top to bottom): PCG iteration count (left), time-to-solution, and speedup of using adaptive-precision block-Jacobi over fixed-precision block-Jacobi (right).
The center column of plots in Fig. 2 displays the number of block-Jacobi sweeps to the ICCG runtime. Here, the performance advantage of a block-Jacobi IR iteration over an exact triangular solver becomes visible: Although the number of ICCG iterations is much higher when using only a few IR iterations, the total runtime can be much shorter than leveraging a direct triangular solve. For the finan512, thermal1 and Dubcova3 problems, the ICCG using block-Jacobi IR is faster than ICCG using exact triangular solves. For the large problems finan512, thermal1 and Dubcova3, we can appreciate significant runtime benefits when using the adaptive-precision block-Jacobi. For the much smaller ted_B problem, the block-Jacobi IR is faster if we perform more than 2 IR sweeps. Only for Muu, which is both small and well conditioned, the approach using iterative triangular solves fails to beat the runtime achieved with direct triangular solves. Overall, we note that the optimal number of block-Jacobi IR sweeps is a problem-dependent parameter.
In the right-hand side column of plots in Fig. 2, we visualize the speedup of ICCG with the adaptive-precision block-Jacobi IR over its fixed-precision block-Jacobi IR counterpart. Where the adaptive-precision technique yields shorter runtime than the fixed-precision technique we observe a general positive trend of the runtime benefits when increasing the number of block-Jacobi IR sweeps. This is expected as increasing the sweep count enlarges the fraction of time spent in the triangular solves, which is where the use of adaptive precision reduces the volume of memory accesses and, therefore, runtime.
For finan512, thermal1 and Dubcova3, the runtime savings grow up to around 10%. This is in accordance with a previous experimental analysis of adaptive-precision block-Jacobi preconditioning [3].
Speed-up of one adaptive-precision Block-Jacobi sweep versus a direct triangular solve of the L and U factors of the regarded ILU decompositions (top) and the according speedup versus one fixed-precision sweep (bottom). The maximum block size is 16. The time of one iteration is taken as the average over the first 100 iterations.
For the following experiment, we select a set of 24 test matrices where block-Jacobi IR is a viable option for the triangular solver. This selection recognizes that iterative triangular solves can fail to propagate the accuracy needed by ICCG to solve the problem. Furthermore, it also acknowledges that, depending on the matrix sparsity structure and the implied dependency tree, iterative triangular solves are not necessarily always faster than exact triangular solves [2]. For these problems (all those listed in Table 1), on the left-hand side plot in Fig. 3 we visualize the speedup of one adaptive-precision block-Jacobi sweep over the level-based direct triangular solve. In the right-hand side plot of the same figure, we display the speedup of the adaptive-precision over the fixed-precision block-Jacobi sweeps. While there are three problems where we suffer small slowdowns, for most of the selected matrices we obtain a speedup, and for some problems we save around 20% computing time.
Optimal number of adaptive-precision Block-Jacobi sweeps to minimize PCG runtime (left) and soft continuous optimization (right).
Next, in Fig. 4 we use the test set of 24 matrices to identify the optimal number of sweeps. Looking at the results in the left-hand side in Fig. 4, we observe that there is no overall optimal sweep count. In order to choose a reasonable number of sweeps count, we give each of our test matrices an index i and define \(\text {ICCG}(n_{\text {sweeps}}, i)\) as the ICCG runtime for matrix i using \(n_{\text {sweeps}}\) block-Jacobi sweeps in the triangular solves. With this, we define the normalized ICCG runtime for matrix i with \(n_{\text {sweeps}}\) block-Jacobi sweeps in the triangular solves,
as the relation between the actual and minimal ICCG runtime. We then look at the soft optimization function
The right-hand side of Fig. 4 shows that for this problem test suite, T is small for about 19 block-Jacobi sweeps per ICCG iteration.
Finally, we compare the optimal number of block-Jacobi IR sweeps for each matrix and the default setting of 19 sweeps to quantify the benefits of using adaptive-precision in the block-Jacobi IR over using fixed-precision block-Jacobi IR. In Fig. 5, we report the speedup of ICCG equipped with adaptive-precision block-Jacobi IR for the triangular solves over ICCG using fixed-precision block-Jacobi IR for the triangular solves. We first focus on the problem-specific optimization of the block-Jacobi IR sweep count. For this setting (minimizing the ICCG overall runtime), we observe that, for about two thirds of the problems, the use of adaptive precision has only a negligible impact or none at all. For roughly one third of the problems, we have single-digit speedups over the fixed-precision usage.
If we fix the number of block-Jacobi IR sweeps to 19 (not considering the problem-specific optimization reducing the ICCG runtime), the benefits of using adaptive-precision over fixed-precision are generally larger. This implies that, when using a default setting – which is realistic for practical use– adopting a adaptive-precision block-Jacobi instead of a fixed-precision block-Jacobi renders significant benefits.
Speedup of ICCG with adaptive-precision block-Jacobi vs. fixed-precision block-Jacobi for the optimal number of sweeps for each matrix (top) and 19 sweeps per iteration as resulting from evaluating the soft continuous optimization function T (bottom).
5 Summary
In this work, we have investigated the benefits of using adaptive precision block-Jacobi for iteratively solving the triangular systems arising in incomplete factorization preconditioning. This was accomplished by linking relaxation methods to iterative refinement and preconditioners. Using the Ginkgo open source library for numerical linear algebra, we set up a Conjugate Gradient (CG) solver using an incomplete Cholesky decomposition preconditioner which replaces exact triangular solves with iterative triangular solves based on block-Jacobi iterative refinement.
Comparing the performance on high-end GPUs, we revealed that using adaptive precision block-Jacobi iterations is generally faster than using fixed precision block-Jacobi iterations.
In accordance with [4], we emphasize that using iterative triangular solves based on (adaptive) precision block-Jacobi is not always the fastest option. But we emphasize that if using block-Jacobi iterations for solving triangular systems in incomplete factorization preconditioning is a valid option, it is likely that the performance can be improved by replacing fixed precision block-Jacobi with our adaptive precision block-Jacobi method.
Notes
References
Suitesparse Matrix Collection. https://sparse.tamu.edu (2020). Accessed Jan 2020
Anzt, H., Chow, E., Dongarra, J.: Iterative sparse triangular solves for preconditioning. In: Träff, J.L., Hunold, S., Versaci, F. (eds.) Euro-Par 2015. LNCS, vol. 9233, pp. 650–661. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48096-0_50
Anzt, H., Dongarra, J., Flegar, G., Higham, N.J., Quintana-OrtĂ, E.S.: Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers. Concurrency Comput. Pract. Experience 31(6), e4460 (2019)
Chow, E., Anzt, H., Scott, J., Dongarra, J.: Using Jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning. J. Parallel Distrib. Comput. 119, 219–230 (2018)
Flegar, G., Anzt, H., Cojean, T., Quintana-OrtĂ, E.S.: Customized-precision Block-Jacobi preconditioning for Krylov iterative solvers on data-parallel manycore processors. ACM Trans. Math. Softw. (2020), under review. Available from the authors
Goebel, F., Anzt, H., Quintana-OrtĂ, E.S., Cojean, T., Flegar, G.: Instructions to generate experimental results for conference proceedings 2020 paper: multiprecision block-Jacobi for Iterative Triangular Solves, July 2020. https://doi.org/10.6084/m9.figshare.12562154, https://springernature.figshare.com/articles/dataset Instructions_to_generate_experimental_results_for_conference_proceedings_2020_paper_Multiprecision_block-Jacobi_for_Iterative_Triangular_Solves/12562154/1
Higham, N.J.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM, Philadelphia (2002)
NVIDIA Corporation: Whitepaper: NVIDIA TESLA V100 GPU ARCHITECTURE (2017)
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. SIAM, Philadelphia (2003)
Acknowledgments and Data Availability Statement
Hartwig Anzt, Fritz Göbel, and Terry Cojean were supported by the “Impuls und Vernetzungsfond” of the Helmholtz Association under grant VH-NG-1241. G. Flegar and E. S. Quintana-Ortà were supported by project TIN2017-82972-R of the MINECO and FEDER and the H2020 EU FETHPC Project 732631 “OPRECOMP”.
The datasets and code generated during and/or analysed during the current study are available in the Figshare repository: https://doi.org/10.6084/m9.figshare.12562154Â [6].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Goebel, F., Anzt, H., Cojean, T., Flegar, G., Quintana-OrtĂ, E.S. (2020). Multiprecision Block-Jacobi for Iterative Triangular Solves. In: Malawski, M., Rzadca, K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12247. Springer, Cham. https://doi.org/10.1007/978-3-030-57675-2_34
Download citation
DOI: https://doi.org/10.1007/978-3-030-57675-2_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57674-5
Online ISBN: 978-3-030-57675-2
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science





