We present how we ported the Hybrid Monte Carlo implementation in the tmLQCD software suite to GPUs through offloading its most expensive parts to the QUDA library.
We discuss our motivations and some of the technical challenges that we encountered as we added the required functionality to both tmLQCD and QUDA.
We further present some performance details, focussing in particular on the usage of QUDA's multigrid solver for poorly conditioned light quark monomials as well as the multi-shift solver for the non-degenerate strange and charm sector in $N_f=2+1+1$ simulations using twisted mass clover fermions, comparing the efficiency of state-of-the-art simulations on CPU and GPU machines.
We also take a look at the performance-portability question through preliminary tests of our HMC on a machine based on AMD's MI250 GPU, finding good performance after a very minor additional porting effort.
Finally, we conclude that we should be able to achieve GPU utilisation factors acceptable for the current generation of (pre-)exascale supercomputers with subtantial efficiency improvements and real time speedups compared to just running on CPUs.
At the same time, we find that future challenges will require different approaches and, most importantly, a very significant investment of personnel for software development.