Three Dirac operators on two architectures with one piece of code and no hassle
May 29, 2019
A simple minded approach to implement three discretizations of the Dirac operator (staggered, Wilson, Brillouin) on two architectures (KNL and core-i7) is presented. The idea is to use a high-level compiler along with OpenMP parallelization and SIMD pragmas, but to stay away from cache-line optimization and/or assembly-tuning. The implementation is for $N_v$ right-hand-sides, and this extra index is used to fill the SIMD pipeline. On one KNL node single precision performance figures for $N_c=3$, $N_v=12$ read 475 Gflop/s, 345 Gflop/s, and 790 Gflop/s for the three discretization schemes, respectively.
How to cite
Metadata are provided both in "article" format (very similar to INSPIRE) as this helps creating
very compact bibliographies which can be beneficial to authors and
readers, and in "proceeding" format
which is more detailed and complete.