Multigrid-preconditioned solvers have proven crucial for the efficient generation of ensembles of gauge configurations at physical quark mass parameters. A high performance implementation of such a solver for GPUs by different vendors and for different types of Wilson fermions is
provided in the QUDA library. It features an autotuner which chooses an optimal communication policy and an optimal set of kernel launch parameters for each kernel, problem size and domain decomposition on each architecture. The performance of the multigrid solver additionally depends on a large number of algorithmic parameters such as block sizes, numbers of vectors, maximum iterations as well as convergence
thresholds. Many of these parameters have to be tuned on a per-level basis, making the search space large and an exhaustive approach essentially computationally intractable. In addition, once a good parameter set is found, in general it will fail to be optimal on a different machine or for a different domain decomposition. We present a simple autotuner for these parameters implemented in the tmLQCD software suite which requires only some intuition on the order in which parameters are to be tuned and the step sizes to be used in the tuning procedure. The simple approach converges quickly, producing stable iteration counts and times-to-solution across gauge configurations of a given ensemble. We demonstrate its applicability to the adjustment of multigrid setups between different machines or different sets of physical parameters on the basis of results from machines with NVIDIA and AMD GPUs.
