[HPGMG Forum] Do we want the benchmark to go into intrinsics?
swwilliams at lbl.gov
Tue Apr 29 17:06:49 UTC 2014
Bad compilers should be shamed. Having it autodetect __bgq__ (or whetever) and run the best manually optimized implementation by defaults hides deficiencies.
On Apr 29, 2014, at 9:56 AM, Mark Adams <mfadams at lbl.gov> wrote:
> On Tue, Apr 29, 2014 at 12:33 PM, Sam Williams <swwilliams at lbl.gov> wrote:
> I think there is a difference in saying
> - we need a high-quality implementation to tease out the kiviat characteristics in order to showcase the compute requirements of FEM
> - we need a high-quality reference implementation that's been manually optimized for Mira.
> I'm not sure what the issue is here. This is portable in that it is not defined for non BGQ. We may not need to deploy the BGQ optimization in the distribution but why not distribute it if we have it?
> Jed seemed to be saying that we need an factor of 2x on _all_ platforms provided by us in reference implementations. This may not be what Jed meant, but we are most likely not going to have the resources to do this. You two are delivered more than we need, probably, but there is nothing wrong with too much power. This is really up to you and Jed as you two are doing all the real work. I think we need to have this '[well] within 2x' in the reference implementation for _a_ machine, and preferably many/most/all of the major platforms. Jed has the bad luck that the kiviat data is on an uncooperative machine, but it is an important platform anyway.
> I don't think you should feel that you need to do everything well. As I said I think that you two are both doing more than a solid job on reference implementations and there will have to be a model of allowing user/center/vendor contributions. So I would not worry about it too much. That said, I do and will salivate at the prospect of a good Phi and Cuda implementations, but don't feel pressure to deliver it.
> On Apr 29, 2014, at 9:22 AM, Jed Brown <jed at jedbrown.org> wrote:
> > Brian Van Straalen <bvstraalen at lbl.gov> writes:
> >> Jed Brown 7b171e1 fe: loop optimizations to TensorContract_QPX
> >> 28 Apr 2014
> >> Jed Brown 339691b fe: initial QPX version of tensor contraction
> >> 28 Apr 2014
> >> Jed Brown b1189a3 make: remove redundant link flags
> >> 28 Apr 2014
> >> This seems like a pretty unportable benchmark idea. Does HPL do this
> >> for the download version? Or are these commits to the research
> >> branch?
> > We need a "high-quality" implementation. The XL compiler is amazingly
> > terrible at producing decent code for the small tensor contractions
> > (versus gcc on x86, which does quite well). Consequently, any
> > performance counter data on BG/Q is entirely measuring the compiler (by
> > about an order of magnitude). The fact that code is easier to optimize
> > on Intel than BG/Q is well-known, but we can't have a credible benchmark
> > without decent code-gen there. I don't want private vendor
> > implementations to be an order of magnitude faster than what people can
> > run for themselves. (A modest difference is unavoidable.)
> > Note that this stuff is not compiled when running on other
> > architectures, so does not impact portability. I also have Intel
> > AVX/FMA intrinsics which are easier to work with and produce nearly 2x
> > speedup over vanilla C (and GCC produces better code than ICC).
> > _______________________________________________
> > HPGMG-Forum mailing list
> > HPGMG-Forum at hpgmg.org
> > https://hpgmg.org/lists/listinfo/hpgmg-forum
> HPGMG-Forum mailing list
> HPGMG-Forum at hpgmg.org
More information about the HPGMG-Forum