[HPGMG Forum] updated HPGMG-FV code

Sam Williams swwilliams at lbl.gov
Sat Dec 20 23:01:26 UTC 2014

I pushed a number of updates to the HPGMG-FV code today.

(1) I added a thread-friendly implementation of the boundary conditions.  By default, these are linear, but the code supports quadratic as well.  However, D^{-1} and the dominant eigenvalue are calculated assuming linear boundary conditions.   There are thus three optimization levels for boundary conditions/D^{-1} as selected by CFLAGS...

--CFLAGS="-fopenmp"	// Thread friendly with minimal data movement & flop's, but complex memory access patterns.
--CFLAGS="-fopenmp -DSTENCIL_FUSE_BC"	// Thread friendly with simple memory access patterns, but extra data movement & flop's.
--CFLAGS="-fopenmp -DSTENCIL_FUSE_BC -DSTENCIL_FUSE_DINV"	// Thread friendly with simple memory access patterns and essentially no extra data movement, but performs far more flop's (including a divide per stencil).

The optimal configuration is likely architecture-specific depending on the balance of FP, cache, and memory performance.  On Edison (Cray XC30), the new code is 10-15% faster than the fused variants.

(2) There is an issue with XL/C's parsing of the OpenMP max reduction clause when embedded inside the _Pragma compiler function (a bug report was submitted to ALCF).  To avoid this, if the code detects XL/C, it won't thread max reductions (i.e. max norm calculations) and will perform them sequentially.  This has obvious performance penalties when running with only one process per node on Blue Gene/Q.  If you wish to quantify the impact, you can replace
 #pragma omp parallel for private(block) if(level->num_my_blocks>1) schedule(static,1) reduction(max:max_norm) 
in the norm() function in operators/misc.c.  I observed the new code in conjunction with this compiler-specific optimization can deliver approximately 20-30% better performance on BGQ at 1K nodes (64K threads)

(3) For those interested in using HPGMG-FV as a research vehicle outside of Top500 rankings, I added 13- and 27-point constant-coefficient operators for flavor.  Both presume periodic boundary conditions when calculating D^{-1}.  As such, you might consider compiling with "-DUSE_PERIODIC_BC -DUSE_HELMHOLTZ".  You can select the operator by replacing operators.7pt.c with operators.27pt.c when compiling.  These operators move less data and perform more FP operations per stencil.  The 13pt also doubles the MPI bandwidth requirements.  This tends to make them more compute and network limited for a fixed problem size compared to the variable coefficient 7-point operator.

(4) I previously observed issues with MVAPICH on Stampede wherein realloc() would frequently fail.  To avoid this, I added a macro BLOCK_LIST_MIN_SIZE which can be used to minimize the number of calls to realloc.  i.e. if the block list is empty, appending the list will create an initial list of size BLOCK_LIST_MIN_SIZE.  By default, it is 1000, but if you run into problems with realloc() failures with MVAPICH, you can compile with -DBLOCK_LIST_MIN_SIZE=10000 or larger.  The blocking/tiling was generalized to block in the unit-stride (i) dimension.  This can increase coarse-grained thread parallelism by an order of magnitude at the expense of sequential/spatial locality.

More information about the HPGMG-Forum mailing list