[HPGMG Forum] More thoughts on the Kaviats

Theodore Omtzigt theo at stillwater-sc.com
Mon Jun 9 13:16:20 UTC 2014


Jed:

  Good catch on the cache hierarchy: that is multi-core specific, GPUs
and KPUs don't have those hierarchies. What we are really after is a
measure of the instruction + operand bandwidth and latency, which has
always been a very complex function of hardware resource availability
and instruction issue timing. Maybe the simplification is offered by the
physical organization of the machine: there will be on-chip memories and
off-chip memories in any machine organization that could be used to
capture operand bw and latency dependencies. In cache hierarchies you
can reasonably ignore the lower levels as they are designed and
dimensioned to support the data flow from the outer cache into the
functional units. if the hw folks have done their job that should be
seamless. Then again, I have spent 15 years finding such cache hierarchy
failures, so maybe it isn't as simple from the outside.

The derivative indeed is the quest, but don't get too comfortable with
that either. Because we are talking about a discrete event system, there
is no linearity to speak off. If you deplete a particular resource, your
performance will fall dramatically ruining your derivative.
Unfortunately, it is the location of the cliff that is most important
for the application designer, and any type of scaling arguments or
interactive presentation you would like to offer your customers.

I think the solution lies in the 'big data' that will be generated by
running the benchmarks on hundreds or thousands of configurations. Weak
scaling measurements are the best in discovering the cliffs in the
resource availability, so instead of asking the hardware or a simulator
to cripple the resource, ask the application to find the resource
bottleneck and hammer it. Then the scaling behavior of the collective
application+hardware needs a couple of data points at the different
cliffs to offer you a pretty good characterization of the dynamic
behavior of the machine.
 
Theo

On 6/9/2014 8:36 AM, Jed Brown wrote:
> Theodore Omtzigt <theo at stillwater-sc.com> writes:
>
>> As always, depending on the question you need to answer, you want to see
>> different Kaviat graphs. As a hardware designer, I would like to see how
>> well a particular resource allocation is helping the performance of an
>> application. As all machines have a common collection of resources, the
>> Kaviat's that I am interested would capture those, in particular this
list:
>>
>> FPU sp peak throughput
>> FPU dp peak  (big difference in silicon allocation)
>> IU peak throughput
>> L1 bw peak (latency is governed by core clock, but bw is parallelism on
>> the cache ports)
>> L2 bw peak (latency becomes an issue further away, but most L2s are
>> supporting fast L1 refresh)
>> L3 bw peak
>> L3 size
>> FSB bw peak
>> Memory latency (for low-concurrency kernels this is the bottleneck)
>> Memory bw peak
>> Network latency
>> Network bw peak
>> Reduction peak throughput
>> Broadcast peak throughput
>>
>> The beauty of this list is that it is uniform across all machines,
>> including non-von Neumann.
>
> This seems to imply a particular cache hierarchy, which may not be
> universal.  I think that to reword your request, we would like to see
> the DERIVATIVE of application performance with respect to each of these
> attributes.  I wrote exactly these words in emails this winter, but I
> don't know how to measure this derivative.  Does there exist a hardware
> platform or simulator capable of incrementally crippling one attribute
> at a time?
>
> If we can measure these quantities for the suite of apps and benchmarks
> on different machines, we could make a very useful interactive
> javascript presentation.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://hpgmg.org/lists/archives/hpgmg-forum/attachments/20140609/7be5adea/attachment.html>


More information about the HPGMG-Forum mailing list