[HPGMG Forum] HPGMG release v0.1

Sam Williams swwilliams at lbl.gov
Mon Jun 9 16:25:25 UTC 2014


for phase II, 1866, I saw up to 102GB/s/node
However, this is what I get currently on my version (I added a dot product to stream to do read-only operations)...

samw at edison12:~/proj/misc/stream> cc -O3 -fno-alias -fno-fnalias -openmp stream.c 
samw at nid01282:~/proj/misc/stream> export OMP_NUM_THREADS=24
samw at nid01282:~/proj/misc/stream> export KMP_AFFINITY=scatter
samw at nid01282:~/proj/misc/stream> aprun -n 1  -cc none ./a.out 
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 64000000, Offset = 0
Total memory required = 1464.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 24
-------------------------------------------------------------
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 13036 microseconds.
   (= 13036 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       63856.1894       0.0161       0.0160       0.0161
Init:       80013.5492       0.0064       0.0064       0.0065
Add:        88402.9165       0.0174       0.0174       0.0174
Triad:      87143.7588       0.0177       0.0176       0.0177
Dot:        96980.3169       0.0106       0.0106       0.0106
-------------------------------------------------------------




samw at nid01282:~/proj/misc/stream> aprun -n 1  -cc none ./a.out 
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 64000000, Offset = 0
Total memory required = 1464.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 16
-------------------------------------------------------------
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 12908 microseconds.
   (= 12908 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       67724.7358       0.0152       0.0151       0.0152
Init:       68158.9376       0.0075       0.0075       0.0075
Add:        87128.4377       0.0177       0.0176       0.0177
Triad:      86112.9059       0.0179       0.0178       0.0180
Dot:        96785.8143       0.0106       0.0106       0.0106
-------------------------------------------------------------




On Jun 9, 2014, at 9:19 AM, Jed Brown <jed at jedbrown.org> wrote:

> "Vitali A. Morozov" <morozov at anl.gov> writes:
>> and see that you provide STREAM-based memory bandwidth for some 
>> architectures. I suggest to specify a particular benchmark, let us say 
>> "triad", because the result of STREAM is benchmark-dependent.
> 
> Yes, I would prefer Triad.
> 
>> For BG/Q, I have measured 29.3 GB/s/node on "triad". For Cray XC30, I 
>> have measured 48.6 GB/s/socket or 97.1 GB/s/node. This is slightly 
>> better than the numbers you have reported.
> 
> I'll update the BG/Q number.  What code is needed to observe this?  (I
> think I've always heard 26-27 GB/s quoted and have not personally
> measured higher.)  It would be helpful to list this somewhere on the
> ALCF website.
> 
> 
> I assume that your 97 GB/s on XC30 using E5-2697v2?  The numbers I used
> come from this page which quotes STREAM Triad at 89 GB/s.
> 
>  http://www.nersc.gov/users/computational-systems/edison/configuration/
> 
>> For Cray XC30, the flop rate is 518.4 GF per node. For Xeon E5-2697 v2 @ 
>> 2.7 GHz, 
> 
> Edison uses E5-2695v2 (2.4 GHz), thus the somewhat lower number.
> 
>> each core can have 8 Flops/cycle - 4 way FMA - or 8 * 2.7 = 21.6
>> GFlops per core. 12 cores result in 259.2 GFlops per socket, 2 sockets
>> give 518.4 GFlops.
> 
> _______________________________________________
> HPGMG-Forum mailing list
> HPGMG-Forum at hpgmg.org
> https://hpgmg.org/lists/listinfo/hpgmg-forum



More information about the HPGMG-Forum mailing list