# Technical Report

An LBNL Technical Report is available.

# Implementations

There are two implementations in the Git repository. They are separate code bases and currently have different user interfaces. The large-scale algorithmic and communication structure is similar, but the local operations have very different computational characteristics.

## Finite Volume

Sam Williams developed HPGMG-FV, a finite-volume full multigrid (FMG) solver for the homogenous, variable coefficient Laplacian. Several smoothers are available, including Gauss-Seidel red-black, block-Jacobi Gauss-Seidel, and Chebyshev. Arithmetic intensity can be varied between 0.59 and 0.97 flops/byte for the present 7-point stencil by recomputing certain quantities rather than reloading them from DRAM. HPGMG-FV weak scales at 3.5% to 6.4% of peak on Mira, Argonne’s IBM Blue Gene/Q. Moreover, HPGMG-FV can differentiate supercomputer’s by both processor architecture and network architecture. For example, one can observe up to a 3x performance difference between NERSC’s XC30 Edison and NREL’s Peregrine when using $128^3$ elements per CPU socket, while HPL shows only 20% difference between these machines. Both machines use an identical Xeon-based node architecture, but where Peregrine uses an Infiniband fat-tree network, Edison uses a Cray Aries Dragonfly network.

### To do

- Encapsulate threading implementation and tuning choices to facilitate examination of alternatives to OpenMP 3.1.
- Optimized implementations for Intel’s Xeon Phi (native and offload modes) and NVIDIA’s GPUs (using CUDA).
- Add other stencils to further vary arithmetic intensity

## Finite Element

Jed Brown developed HPGMG-FE, a finite-element full approximation scheme (FAS) FMG solver for constant- and variable-coefficient elliptic problems on mapped coordinates using $Q_1$ or $Q_2$ elements. We prefer the case of quadratic elements, for which the discretization has 3rd-order accuracy in $L^2$ and 4th-order superconvergence at vertices. Chebyshev smoothers are used, and FMG convergence is observed using a V(3,1) cycle, for a total of 5 fine-grid operator applications to converge on the fine grid. Arithmetic intensity can be varied from 2.4 flops/byte to 20 flops/byte or higher. The mapped coordinates and local arithmetic structure requires more memory streams and draws out the conflicting demands of memory locality and vectorization. HPGMG-FE weak scales at 23% of peak on Edison. This implementation currently requires PETSc.

### To do

- Add elasticity and Gauss-Lobatto quadrature variants
- Remove PETSc dependency
- OpenMP local kernels
- Optimized implementations for Xeon Phi and CUDA