# Method

HPGMG-FE solves constant- and variable-coefficient elliptic problems on deformed meshes using FMG. It uses $Q_2$ elements and evaluates all operators matrix-free, exploiting tensor-product structure. This structure is similar to that used in spectral element methods (multiple Gordon Bell prizes), but is still beneficial for $Q_2$ elements. The method is third-order accurate in $L^2$, as demonstrated by the FMG convergence. (For linear problems, the discretization is 4th-order superconvergent at nodes, but we do not exploit this property.) Interpolation and restriction are done using the natural embedding of finite-element spaces, with tensor-product optimization. On coarse levels, the active process set is restricted using Z-order so that interpolation and restriction mostly involve processes that will be "nearby" on a computer. Chebyshev polynomials are used for smoothing, preconditioned by the diagonal. FMG convergence is observed with a V(3,1) cycle, thus convergence is reached in a total of 5 fine-grid operator applications.

# Rationale

The present design of HPGMG-FE was influenced by a number of considerations about performance and application workloads.

## Finite elements

The finite-element (FE) method is not the simplest approach to discretizing a PDE on a regular grid. However, it is popular due to its flexibility and suitability for libraries and frameworks. Its distinguishing feature relative to standard finite-difference (FD) and finite-volume (FV) methods is that work is shared between degrees of freedom (dofs). Since FD and FV are so similar for simple elliptic equations on structured grids, we describe the effect in terms of FD methods, for which dofs are stored at vertices of the mesh. For FD, the residual at each vertex is straightforward to compute directly based on the state at vertices in a neighborhood. Most or all of the computation for the residual at an adjacent vertex shares no common intermediate results, thus parallelization via a simple vertex-partition (dof-partition) has little or no compute overhead relative to the best possible sequential implementation. Incidentally, this is not the case for more advanced FD and FV discretizations, such as slope- or flux-limited upwind methods and those with complicated material models. For example, MUSCL, a TVD FV method, would use neighbors to reconstruct a gradient, then use the result to evaluate at faces, solve a Riemann problem at the faces, and sum the result back into cells. Parallelization via a simple dof-partition would require redundant computation of reconstructed gradients and redundant Riemann solves in the overlap. In strong-scaling scenarios, the subdomains are small enough that such redundant computation can more than double the cost.

Finite-element methods possess this attribute because computation is associated with cells while dofs are associated with vertices. Parallelization via dof-partition would involve redundant computation over elements in the overlap. For example, consider an $8\times 8\times 8$-element subdomain, which for $Q_2$ elements would imply ownership of 4096 dofs for a scalar problem. If computation is arranged for a non-overlapping dof-partition, the cost is $9^3/8^3 = 1.42$ times that of a non-overlapping element partition. This is several times larger than the optimal subdomain size for coarse levels (i.e., operator application is still strong-scaling, albeit at reduced efficiency), so dof-partition overhead is even higher on coarse levels. Computing with a non-overlapping element partition (instead of dof-partition) involves overlapping writes to the vertex-based residuals. There are many ways to coordinate these overlapping writes, each with storage/compute/latency tradeoffs. We believe it is valuable to encourage implementations to use these techniques rather than a simple dof-partition.

## $Q_2$ elements

Spectral-element applications such as Nek5000, SPECFEM, and HOMME are presently used in machine procurements, have won performance prizes, and represent a significant fraction of HPC computation. On the other hand, low-order methods are overwhelmingly chosen for structural mechanics and heterogeneous materials such as porous media, also representing a significant fraction of HPC computation for research and industry. High-order methods vectorize readily and are more compute-intensive (flops and instruction issue) while low-order methods are more dependent on memory bandwidth and less regular memory access. $Q_2$ elements strike a balance between classical spectral-element efficiency (demonstrated at $Q_3$ and higher) and low-order methods. Additionally, $Q_2$ elements have a larger penalty for redundant computation than $Q_1$ elements, thus further emphasizing thread synchronization.

For $Q_3$ and higher order elements, it is currently possible to achieve high performance vectorizing within each element. This is no longer true for $Q_2$, where it becomes necessary to vectorize across elements. Vectorizing across elements is simpler code to write since it needs no cross-lane instructions, but increases the working set size, which should fit within L1 cache for maximum efficiency. If vendors increase the length of vector registers or add more hardware threads without increasing cache, performance on $Q_2$ elements will drop unless they can effectively vectorize within elements (i.e., number of elements in working set smaller than length of vector register). That would require more cross-lane operations and is more complicated to implement. Few of today's applications can effectively utilize vector registers today, so the benchmark should not reward them for hardware that goes nearly-unused outside of dense linear algebra.

## Full approximation scheme (FAS)

FAS is a technique primarily used for solving nonlinear problems. It involves somewhat more vector work and one more coarse-grid operator application relative to the conventional correction scheme. Although this benchmark only solves linear problems, FAS enables an inexpensive check for consistency between levels (potentially useful for soft error detection) and forms the basis for advanced multigrid techniques (such as "frozen $\tau$" and segmental refinement). We also believe that nonlinear multigrid methods will become more important as computer architectures continue to evolve. Finally, the modest increase in vector work is representative of acceleration techniques such as Krylov, time integrators such as Runge-Kutta, and various analysis techniques. This increases the demand for memory bandwidth, somewhat balancing our compute-intensive operators.

## Linear or nonlinear?

A benchmark should have similar computational characteristics at all scales. Nonlinearities exhibit different behavior at different scales, so convergence rate would depend on resolution for any strongly nonlinear operator. In order to maintain a strong theoretical foundation for convergence independent of resolution, we have chosen to focus on linear problems. This choice could be revisited.

Evaluating nonlinearities often involves special functions, polynomial or rational fits, table lookups, and iteration (e.g., implicit plasticity model, implicit equation of state, some Riemann solvers). The first two are compute-limited, table lookups are more data-intensive, and the last creates a load-balancing challenge because the iteration may converge at different rates. Since the characteristic of table lookups depends so strongly on the table size and interpolation scheme, we are reluctant to utilize a table lookup in the benchmark. We would love to include the irregularity imposed by iteration, but have not found a clean way to include it without compromising other values of the benchmark. Please write the mailing list if you know of a way.

## Why mapped coordinates?

Mapped coordinates account for a large fraction of data motion in the scalar solver. Assuming affine coordinates would reduce this data dependency and simplify the kernels somewhat. We like mapped coordinates because many real applications have mapped coordinates (e.g., for curved boundary layers or interface-tracking) and/or stored coefficients, and because evaluating the metric terms increases the computational depth of each element. Some vendors have "cut corners" by supporting only a small number of prefetch streams or having high-latency instructions with very small caches. These cost-cutting measures are okay for some benchmarks, but impact performance for many applications. Therefore, including coordinate transformations in the benchmark reward vendors for creating balanced, versatile architectures.

## Quantifying dynamic range

HPGMG-FE samples across a range of problem sizes, from the largest that can fit in memory to the limit of strong scalability. This is done in order to quantify "dynamic range", the ratio of the largest problem size to the smallest that can be solved "efficiently" (we set this threshold at 50% efficiency relative to the largest size). Every HPC center has some users that are focused on throughput and some that are focused on turn-around time. For almost any reasonable architecture and scalable algorithm, filling memory will provide enough on-node work to amortize communication costs. A benchmark that is run only in the full-memory configuration will be representative for the throughput-oriented users, but says nothing for those interested in turn-around time. When combined with peak performance, the dynamic range quantifies the ability to appease both user bases. Note that since some applications have different phases with each characteristic, it is not sufficient to build different machines for each type of workload. A benchmark that rewards dynamic range will encourage versatility machines rather than those configured with just enough memory to balance communication latency for the benchmark (e.g., replace expensive network with a cheap network and add an extra DIMM on each socket to achieve the same benchmark score).

## Computation of the diagonal

Computation of the diagonal has not been optimized and is not a timed part of the benchmark. Although embarrassingly parallel, its cost is similar to that of the entire FMG solve. Optimizing this operation is relatively uninteresting, therefore we pre-compute it. While this may be justified for linear problems, especially those that must be solved multiple times, the cost is very meaningful for nonlinear multigrid solvers based on finite-element methods.

# Possible variants

The computational characteristics of HPGMG-FE can be adjusted significantly while retaining the same overall structure.

## Stored tensor-valued coefficients

Variable coefficients with anisotropy can be defined via coefficients stored at quadrature points. In general, this will be a $3\times 3$ symmetric tensor (6 unique entries), but structure such as $\alpha\mathbf 1 + \mathbf w \otimes \mathbf w$, where $\mathbf w$ is a 3-component vector, can arise from Newton-linearization of nonlinear constitutive relations (a common source of such linear problems). Since the coefficients are defined at quadrature points, they have no opportunity for reuse. Some balance of reusable memory streams and non-reusable memory streams is common in applications, especially those using assembled sparse matrices. Some architectures have used simplified cache eviction policy (e.g., FIFO on BG/L and BG/P) which perform poorly because the reusable cache lines are flushed out by cache lines that will not be reused. A memory access pattern containing both reusable and non-reusable streams will favor more versatile cache and prefetch architectures.

### Absorb metric terms into stored coefficients

For the scalar problem, the metric terms can be absorbed into a symmetric $3\times 3$ tensor at each quadrature point, removing the need to compute the transformations on the fly. If general coefficients are already being used, the metric terms simply transform the existing coefficient. If not, the asymptotic memory per element increases from $3\cdot 8 = 24$ values to $6\cdot 27=162$ values.

## Elasticity

The scalar problem in current implementations can be replaced by a vector-valued problem such as elasticity. If mapped grids are used for both, this does not significantly change the arithmetic intensity, but does increase cache utilization and reduces the relative cost of coordinate transformations. General anisotropy for elasticity involves a rank-4 tensor that contains 21 unique entries (due to symmetry). If the tensor coefficient arises from Newton-linearization of a strain-hardening material, the tensor can be represented using 7 unique entries. Trading on-the-fly computation of the coordinate transformations for stored coefficients is thus significantly higher overhead unless the linear problem already had general anisotropy. Since much of the work in all implementations is in small tensor contractions, only one kernel (operating on three fields) must be optimized. However, we believe that the current implementation is well-optimized despite using the same source code for both scalar and 3-field evaluation. Elasticity may impact convergence rate relative to Poisson, though it can be maintained for closely-related vector-valued problems so we don't see this as a major concern.