[HPGMG Forum] submitting HPGMG-FV results
swwilliams at lbl.gov
Mon Jan 26 19:09:47 UTC 2015
I invite anyone interested in submitting HPGMG-FV results on their systems for HPGMG rankings to email them to me (SWWilliams at lbl.gov). Of particular interest are results on manycore or accelerated systems in which the ratio of memory capacity to memory bandwidth is substantially different than commodity CPU systems whose ratios are often on the order of 1 second.
There have been a few questions on submission guidelines. I'm including the current guidelines below. They will also be maintained on http://crd.lbl.gov/departments/computer-science/performance-and-algorithms-research/research/hpgmg/
* Submissions must run the variable-coefficient 7pt poisson using the chebyshev smoother, dirichlet boundary conditions, and an F-Cycle (the defaults). Submissions may not change the number of smooths or the degree of the polynomial in operators.7pt.c or use periodic boundary conditions. Legal tuning options are described below.
* Submissions must include the full text output of the HPGMG-FV benchmark (from the MPI threading model to the ||error|| analysis).
* Currently, any problem size is acceptable for peak performance (thus ./hpgmg 8 16 is perfectly acceptable). We are evaluating the effect of reduced memory. As such, we request (but do not mandate) submissions include the output for smaller boxes sizes (problems with grid spacings of 2h and 4h) by simply running (from the example above) ./hpgmg 7 16 and ./hpgmg 6 16. Currently, we will present the best performance (DOF/s printed in the results).
* Submissions may use any combination of threads and processes.
* Submissions may use the -DSTENCIL_FUSE_DINV -DSTENCIL_FUSE_BC options. This may change in the future if we decide on a more accurate/complex boundary condition.
* Submissions may vary the -DBLOCKCOPY_TILE_* tiling options and -DBOX_ALIGN_* padding options to find whichever work best for the target machine. For flat MPI, the former enable cache blocking(tiling). The default 8x8x10000 (K,J,I) worked well on smallish problems on most CPUs. However, on Xeon Phi's, I found 4x4x10000 worked better, while on BGQ, I found 1x32x10000 worked better.
* If you use XL/C on a BGQ system, you may modify norm() as discussed on the hpgmg mailing list.
* Ports of HPGMG-FV to other programming models and architectures must conform to the functionality and spirit of the MPI+OpenMP reference implementation and should produce identical results in the absense of a Krylov bottom solver or the rounding differences between discrete multiply/add instructions and a fused multiply-add instruction.
More information about the HPGMG-Forum