Status

Technical Report

An LBNL Technical Report is available.

Implementations

There are two implementations in the Git repository. They are separate code bases and currently have different user interfaces. The large-scale algorithmic and communication structure is similar, but the local operations have very different computational characteristics.

Finite Volume

Sam Williams developed HPGMG-FV, a finite-volume full multigrid (FMG) solver for the homogenous, variable coefficient Laplacian. Several smoothers are available, including Gauss-Seidel red-black, block-Jacobi Gauss-Seidel, and Chebyshev. Arithmetic intensity can be varied between 0.59 and 0.97 flops/byte for the present 7-point stencil by recomputing certain quantities rather than reloading them from DRAM. HPGMG-FV weak scales at 3.5% to 6.4% of peak on Mira, Argonne’s IBM Blue Gene/Q. Moreover, HPGMG-FV can differentiate supercomputer’s by both processor architecture and network architecture. For example, one can observe up to a 3x performance difference between NERSC’s XC30 Edison and NREL’s Peregrine when using $128^3$ elements per CPU socket, while HPL shows only 20% difference between these machines. Both machines use an identical Xeon-based node architecture, but where Peregrine uses an Infiniband fat-tree network, Edison uses a Cray Aries Dragonfly network.

To do

  • Encapsulate threading implementation and tuning choices to facilitate examination of alternatives to OpenMP 3.1.
  • Optimized implementations for Intel’s Xeon Phi (native and offload modes) and NVIDIA’s GPUs (using CUDA).
  • Add other stencils to further vary arithmetic intensity

Finite Element

Jed Brown developed HPGMG-FE, a finite-element full approximation scheme (FAS) FMG solver for constant- and variable-coefficient elliptic problems on mapped coordinates using $Q_1$ or $Q_2$ elements. We prefer the case of quadratic elements, for which the discretization has 3rd-order accuracy in $L^2$ and 4th-order superconvergence at vertices. Chebyshev smoothers are used, and FMG convergence is observed using a V(3,1) cycle, for a total of 5 fine-grid operator applications to converge on the fine grid. Arithmetic intensity can be varied from 2.4 flops/byte to 20 flops/byte or higher. The mapped coordinates and local arithmetic structure requires more memory streams and draws out the conflicting demands of memory locality and vectorization. HPGMG-FE weak scales at 23% of peak on Edison. This implementation currently requires PETSc.

To do

  • Add elasticity and Gauss-Lobatto quadrature variants
  • Remove PETSc dependency
  • OpenMP local kernels
  • Optimized implementations for Xeon Phi and CUDA