This note provides a short benchmark comparison of a number of Pentium, Athlon and Xeon processors in Linux PCs, as well as some older compute servers such as the SGI Origin 2000 and IBM SP2/SP3. Rather than relying on vendor provided estimates of clock speed I've used critical parts of production codes that we use in geophysical modeling.
In summary, the Athlon and Pentium processors appeared to be quite compatible at similar (scaled) clock speeds through the beginning of 2002. Since then the Pentium 4 chips have gained considerable speed over similar Athlons. Large gains can be made by carefully tuning optimization and using optimized BLAS. The smaller cache on faster Pentiums and Athlons (256k L2 instead of 512k L2) may not guarantee scaled performance with CPU speed, although these drawbacks can potentially be mitigated by carefully tuned BLAS.
The results on the IBM (case 2 only) are based on the portable (single CPU) version as well as the optimized sky line solver in ESSL. This makes a rather large difference...
beatrice: SGI Origin 2000 w/ MIPS R10000 @ 195 MHz 4 Mbyte L2 cache. Irix 6.5.
panoramix: Pentium III @ 450 MHz 512 kbyte L2 cache, 100 MHz bus. Redhat 6.2.
toutatis: Pentium III @ 533 MHz 512 kbyte L2 cache, 133 MHz bus. Redhat 6.2.
gemini: AMD Athlon @ 800 MHz 512 kbyte L2 cache, SuSE 6.3
sanglier: AMD Athlon @ 900 MHz 256 kbyte L2 cache, RedHat 7.0 (+DDRAM)
dualin: AMD Athlon @ 1600 MHz 256 kbyte L2 cache, Redhat 7.1 (+DDRAM)
spe01: IBM SP2 166 MHz Power2 processor
spe17: IBM SP3 375 MHz Power3 processor
virgilio: SGI O2 MIPS R12000 @ 300 MHz, 1 Mbyte L2 cache, Irix 6.5
dogmatix: Sun Blade 100 UltraSparc IIe ? MHz, L2: 256 Kbyte Solaris 8
idefix: Sun Ultra 10 UltraSparc IIi 300 MHz L2: 1 Mbyte Solaris 8
valkyrie: Sun Ultra 10 UltraSparc IIi 440 MHz L2: 2 Mbyte Solaris 8
geowall: Sun Netra T1 200 UltraSparc-IIe 500 MHz, L2: 256 Kbyte Solaris 8
big sun: Sun Fire 280R, UltraSparc-III 900 MHz, L2: 1 Gbyte+ Solaris 8
new sanglier: Intel P4 2.0GHz L2: 512 Kbyte RedHat 7.3 1 Gbyte DDRAM
scott_amd: AMD Athlon 2000+ (1667 MHz) L2: 256 Kbyte
jetfire: AMD Athlon XP 2200+ (1800 MHz) L2: 256 Kbyte, 1 Gbyte DDRAM, Asus VIA A7V33, REdhat 8.0, gcc 3.2
sanglier-3Pentium 4 3.2 GHz L2: 2048 Kbyte, 2 Gbyte DDR2, RHE3.5
optimusprime: 2x AMD Athlon MP 2200+ (1800 MHz) L2: 256 Kbyte, 1 Gbyte DDRAM, Asus A7M266D (AMD 760MPX), Redhat 8.0, gcc 3.2
helmholtz2x Intel Xeon 2.4 Gbyte L2: 512K, 2 Gbyte DDRAM, Redhat 7.3
starscream: Pentium 4 2.4 GHz L2: 512 Kbyte, 1 Gbyte DDRAM, Asus P4PE (Intel 82845PE), Redhat 8.0, gcc 3.2
trans01Pentium 4 2.53 GHz L2: 512 Kbyte, 1 Gbyte DDRAM, Redhat 8.0 board?
trans02Pentium 4 2.4 GHz L2: 512 Kbyte, 2 Gbyte DDRAM, Redhat 7.3 board?
trans4Pentium 4 2.53 GHz L2: 512 Kbyte 2 Gbyte DDRAM Redhat 7.3 Intel 850
hotamdAMD Opteron 240 1.4 GHz L2: 1 Mbyte, 2 Gbyte RAM Redhat
suskeAMD64 Opteron 1.8 GHz L2: 1 Mbyte, 4 Gbyte DDRAM, Redhat professional 9.0 (2.4.21-9.EL) Tyan S2880
kawamoriDual core AMD Opteron (2x1.8 GHz) L2: 1 Mbyte, 2 Gbyte DDRAM, Fedora Core 3.
nyxxeonDual dual core Intel Xeon 5610 (4x3.0 GHz) L2: 4 Mbyte, 8 Gbyte DDRAM
nyx240Dual single core Opteron240 (2x1.4 GHz) L2: 1 Mbyte, 2 Gbyte DDRAM
nyx2218Dual dual core Opteron2218 (4x2.6 GHz) L2: 1 Mbyte, 8 Gbyte DDRAM
nyx2220Dual dual core Opteron2220 (4x2.8 GHz) L2: 1 Mbyte, 8 Gbyte DDRAM
legatoDual dual core Opteron2200 (4x2.8 GHz) L2: 1 Mbyte, 16 Gbyte DDRAM
fluxDual six core Intel Nehalem (Xeon X5650 @ 2.67 GHz; cache 12Mbyte; 50 Gbyte).
On the linux machines the tests were performed with code compiled by the OS provided gnu compilers, the Portland group compilers (www.pgroup.com), the free but unsupported Intel compilers (www.intel.com) and the EKO Pathscale compilers (www.pathscale.com). For SGI, IBM, and Sun I used the local optimizing compilers. Four specific cases were investigated. Two are based on long standing, small problems that provide a good idea of CPU performance that can be expected for finite element modeling. Two others are simple tests based on full production codes for geophysical fluid dynamics and solid state physics.
Case 1: Linpack: Linpack style benchmark. Solution of NxN full matrix using Linpack/blas. Non-optimized linpack, so it really gives you the performance that you'd get when implementation algebraic routines using (optimized) blas. Perhaps not the most elegant, but fairly typical of most user codes.
Case 2: Profile: decomposition of a matrix-vector system in profile storage. Workhorse for 2D finite element codes that solve the Navier-Stokes equation with the penalty function method.
Case 3: SEPRAN: 2D thermal convection using finite element approach. Tested are both penalty function method (direct solution) and integrated methods (with CG style iterative techniques)
Case 4: VASP: computation of vibrational spectrum of MgO and perovskite.
Compilers used:
SGI: f77 = MIPS f77 compiler;
PC: g77 = gnu fortran (2.96?); pgf77 = Portland Group Fortran compiler; ifc = Intel compilers
IBM: xlf
Sun: f77
We use optimized blas (when indicated) using the appropriate libraries from the ATLAS project (-latlas) or those developed by Kazushige Goto (-lgoto). On the 64 bit Opteron we also tried the AMD Math Core Library that is suggested to provide optimized BLAS.
Notes:
* -fETC means the combination of -ffast-math -fexpensive-optimizations -funroll-all-loops
** -latlas means use of optimized BLAS routines from Atlas project (www.netlib.org/atlas)
*** same, but now compiled with the 3Dnow technology
**** -all for Intel is: -O3 -unroll -vec -pad -axM -ipo
-latlasP4SSE2: precompiled blas from the Atlas project (www.netlib.org/atlas) for the Pentium 4.
machine compiler cpu time (s) improvement over best beatrice
N= 300 1000 2000
beatrice f77 -O3 -64 0.23 11.06 104.7 1.00
f77 -O3 -64 -lblas 0.23 10.90 105.4 0.99
panoramix g77 -O 0.4 17.4 159.2
g77 -O -fETC* 0.28 15.8 127.1
ifc -all**** 0.27 12.7 115.0
ifc -all -tpp6 0.26 12.7 114.8 0.91
toutatis g77 -O3 0.44 22.8 190.6
g77 -O2 -fETC* 0.28 15.6 127 0.82
pgf77 -fast -O2 0.29 15.3 127.6
g77 -O 0.38 20.8 170
gemini g77 -O3 0.24 15.4 125.5
g77 -O2 -fETC* 0.21 12.8 104.7
g77 -O -fETC* 0.21 12.8 104.0
pgf77 -fast -O2 0.16 10.4 81 1.30
pgf77 -fast -lblas 0.2 12.5 100.7
sanglier g77 -O 0.20 10.0 82.3
g77 -O -atlas 0.22 10.3 83
g77 -O -fETC 0.17 7.8 63.4 1.67
ifc -O3 -atlas 0.22 10.4 83.9
ifc -O3 -vec 0.17 7.8 63.5 1.67
dualin g77 -O3 0.15 6.6 52.9 2.00
idefix f77 -fast 0.3 19.8 160.4 0.65
dogmatix f77 -fast 0.3 16.9 137.4 0.76
valkyrie f77 -fast 0.14 14.5 124.2 0.85
geowall f77 -fast 0.28 14.7 119.46 0.88
big sun f77 -fast 0.1 3.9 82.4 1.30
new sanglier
ifc -O3 -axW -tpp7 -vec \
-unroll -mp 0.07 4.3 38.8
ifc -O3 -axW -tpp7 -vec \
-unroll -mp -ipo 0.07 4.3 38.8
g77.3.1 -O3 0.07 4.2 38.5
scott_amd
g77.2.96.1 -O3 0.15 6.5 50.2
ifc -O3 0.14 6.0 44.0 2.38
ifc -O3 -axW -vec \ 0.16 7.0 52.7
-ipo -mp
01/2003
starscream
g77.3.2 -O3 0.06 3.3 28.1
g77.3.2 -O3 -atlas 0.05 3.5 27.0
ifc7 -O3 (-mp) 0.06 3.9 30.8
ifc7 -O3 -altas (-mp) 0.06 3.9 30.9
optimusprime
g77.3.2 -O3 0.13 6.2 50.3
g77.3.2 -O3 -atlas 0.12 5.0 40.1
jetfire
g77.3.2 -O3 0.11 5.0 39.9
g77.3.2 -O3 -atlas 0.12 4.4 33.9
helmholtz
ifc -O3 -axW -vec \
-ipo -mp 0.05 3.1 23.3
hotamd g77.3.3.2 -O 0.36 2.8 26.7
g77.3.3.2 -atlas 0.48 2.6 19.3 (atlas3.5.6: HAMMER64SSE2_2)
g77.3.3.2 -goto 0.23 3.3 24.9
All machines without X
04/2003:
starscream
g77.3.2 -O3 0.06 3.87 28.1
g77.3.2 -O3 -atlas 0.05 3.55 27.6
ifc7 -O3 (-mp)
ifc7 -O3 -atlas (-mp)
trans01 g77.3.2 -O3 -fetc 0.06 4.06 27.6
g77.3.2 -O3 -atlas 0.06 4.08 27.8 ?
ifc -O3 (-mp) 0.06 4.06 27.6
ifc -O3 -atlas (-mp) 0.05 3.75 26.8
trans02 g77.3.2 -O3 -fetc 0.08 3.68 29.5
g77.3.2 -O3 -atlas 0.06 3.73 29.6 ?
ifc -O3 (-mp) 0.06 3.76 29.6
ifc -O3 -atlas (-mp) 0.05 3.53 29.4 ?
trans4 g77.2.96 0.04 1.95 15.21
g77.2.96 -atlas 0.04 1.90 15.8
04/2004:
suske (X) g77.3.2 -O3 0.03 1.70 14.1
g77.3.2 -O3 -latlas 0.04 1.24 10.1
g77.3.2 -O3 -lgoto 0.03 1.68 14.0
compute-0-0 (no X)
pgf77 -O3 0.06 1.96 14.7
pgf77 -O3 -lgoto 0.04 1.94 14.7
pgf77 -O3 -lblas 0.04 1.94 18.9 (why?)
08/2005:
sanglier-3 (with X)
pgf77 -fastsse -O3 -latlas 0.02 1.36 11.5 (Linux_P4SSE2)
g77 -O3 -latlas 0.03 1.49 11.6
12/2005
kawamori (no X)
pgf77 -O3 -fastsse -lblas 0.025 1.71 13.1 (PGI supplied blas)
pgf77 -O3 -fastsse -lacml 0.025 1.72 12.8 (PGI supplied acml)
pgf77 -O3 0.025 1.56 12.43
pgf77 -O3 -latlas 0.030 1.24 9.97 (Linux_HAMMER64SSE2)
pgf77 -O3 -fastsse -latlas 0.030 1.25 9.90
gfortran -O3 -lacml 0.025 1.58 12.58 (acml from AMD)
gfortran -O3 0.025 1.54 12.29
gfortran -O3 -lgoto 0.025 1.39 10.78
gfortran -O3 -latlas 0.025 1.56 9.96
pathscale -O3 0.025 1.59 12.33
pathscale -Ofast 0.025 1.58 12.29
pathscale -Ofast -lacml 0.025 1.59 12.69 (AMD supplied)
pathscale -O3 -lgoto 0.019 1.39 10.80
pathscale -O3 -latlas 0.030 1.25 9.89
08/2007
miyazaki (single dual core Opteron 1.8GHz 8 Gbyte DDRAM)
gfortran -O3 0.03 2.10 16.6
gfortran -O3 -atlas 0.04 1.49 13.03
gfortran -O3 -goto 0.02 2.0 15.4
pgf95 -O3 0.03 2.14 16.9
pgf95 -O3 -atlas 0.04 1.51 13.0
pgf95 -O3 -goto 0.02 1.87 14.4
nyx240 BEST: 14.4
pgi -O3 0.034 2.59 22.35
pgi -O3 -atlas 0.048 1.72 14.4
pgi -O3 -goto 0.025 1.98 16.1
nyxxeon BEST: 9.0
pgi -O3 0.085 0.72 9.41
pgi -O3 -atlas 0.015 0.71 9.05
pgi -O3 -acml 0.013 0.85 9.86
pgi -O3 -goto 0.058 0.65 9.05
ifort -O3 0.089 0.74 9.28
ifort -O3 -fast 0.063 0.68 9.15
ifort -O3 -atlas 0.098 0.71 9.01
nyx2218 BEST: 7.0
pgi -O3
pgi -O3 -acml 0.018 1.40 10.92
pgi -O3 -atlas 0.020 0.86 7.00
pgi -O3 -goto 0.013 1.16 8.92
nyx2220 BEST: 7.0
pgi -O3 0.016 1.32 10.71
pgi -O3 -acml 0.017 1.27 10.31
pgi -O3 -atlas 0.017 0.82 7.06 (!)
pgi -O3 -goto 0.012 1.19 9.62
06/2008
legato: BEST 6.71
ifc -O3 0.017 1.24 9.80
ifc -O3 -acml 0.017 1.22 9.76
ifc -O3 -goto 0.012 1.04 8.00
ifc -O3 -atlas 0.021 0.86 6.78
pgi -O3 -fastsse 0.016 1.27 10.03
pgi -O3 .. -acml 0.018 1.24 9.80
pgi -O3 -atlas1 0.021 0.83 6.68 using default atlas (rpm)
pgi -O3 -atlas2 0.016 1.04 8.04 using 'tuned' atlas (/opt/packages)
gfortran -O3 0.017 1.23 9.91
gfortran -O3 -atlas1 0.021 0.86 6.71 <
gfortran -O3 -atlas2 0.024 0.84 6.71
gfortran -O3 -goto 0.013 1.00 7.74
machine compiler cpu time (s) improvement over beatrice
beatrice f77 -O3 -64 -lblas 4.92 1.00
panoramix pgf77 -fast -O3 -lblas 5.13
pgf77 -fast -O3 -latlas** 4.73 1.04x
g77 -O 5.13
ifc -O3 4.23
ifc -all -tpp6 4.20
toutatis pgf77 -fast -O -latlas** 4.22
pgf77 -fast -O3 -lblas 5.13
g77 -O 5.07 1.17x
g77 -O -latlas** 4.30
gemini pgf77 -fast -O3 -lblas 3.68
pgf77 -fast -O3 -latlas** 2.90
g77 -O 2.87
g77 -O3 -fETC* 3.83
g77 -O2 -fETC* 3.79
g77 -O -latlas** 2.88 1.71x
sanglier g77 -O -latlas*** 2.87
ifc -all 2.78 1.77x
ifc -all -latlas 2.89
dualin g77 -O3 2.31 2.13x
spe04 xlf -O4 -lblas 12.0 0.41x
xlf -O4 -lessl 12.8
spe17 xlf -lblas 2.7
xlf -O4 -lblas 2.2
xlf -O4 -lessl 1.9 2.60x
Using profile solver DSKFS from ESSL instead of PROFGG:
spe17 xlf -O4 -lessl 0.78 6.30x
virgilio f77 -O3 -G0 -static 3.59 1.37x
dogmatix f77 -O3 8.23
f77 -fast 5.43 0.91x
idefix f77 -fast 6.06 0.81x
valkyrie f77 -fast 3.28 1.50x
geowall f77 -fast 5.00 1.02x
big sun f77 -fast 2.55 1.92x
new sanglier
ifc -O3 -ipo -axW -vec 1.00 (floating point inaccurate: 1e-5)
ifc -O3 -ipo -axW -vec -mp1 1.08 (floating point inaccurate: 1e-5)
ifc -O3 -mp 1.35 (accurate to 7 digits)
ifc -O3 -ipo -axW -vec -mp 1.28 (accurate to 7 digits)
ifc -O3 -ipo -axW -vec \
-unroll -tpp7 -mp 1.02 (accurate to 7 digits)
ifc -O3etc -latlasP4SSE2 1.00 (accurate to 7 digits)
g77.2.26.1 -O3 -fetc 1.21 (accurate to 7 digits)
g77.2.26.1 -O3 1.12 (accurate to 7 digits)
g77.2.96.1 -O3 -latlasP4SSE2 1.12 (accurate to 7 digits)
g77.3.1 -O3 1.05 (accurate to 7 digits) 4.68
g77.3.1 -O3 -latlasP4SSE2 1.06 (accurate to 7 digits)
scott_amd
f77.2.96.1 -O3 -fetc 2.49 (accurate to 7 digits)
ifc -axW -vec -ipo -O3 -mp 2.40 (accurate to 7 digits) 2.05
ifc -axW -vec -ipo -O3 1.91 (1e-5 inaccurary)
01/2003
starscream
f77.3.2 -fetc 1.04
f77.3.2 -O3 -atlas 0.92
optimusprime
f77.3.2 -fetc 2.67
f77.3.2 -O3 -atlas 2.17
jetfire
f77.3.2 -fetc 2.1
f77.3.2 -O3 -atlas 1.7
04/2003
helmholtz
ifc -axW -vec -ipo -O3 -mp 1.17 (accurate to 7 digits)
ifc -axW -vec -ipo -O3 0.85 (1e-5 inaccurary)
hotamd g77.3.3.2 -fetc 0.82
g77.3.3.2 -fetc -atlas 0.63
All machines without X
04/2003:
starscream
g77.3.2 -fetc 0.97
g77.3.2 -fetc -atlas 0.86
ifc7 -O3 -vec.. 0.88 (imprecise)
ifc7 -mp -O3 1.14
ifc7 -mp -O3 -atlas 0.90
trans01 g77.3.2 -fetc 1.00
g77.3.2 -fetc -atlas 0.86
ifc7 -O3 -vec.. 0.85 (imprecise)
ifc7 -mp -O3 1.11
ifc7 -mp -O3 -atlas 0.86
trans02 g77.3.2 -fetc 0.99
g77.3.2 -fetc -atlas 0.94
ifc7 -O3 -vec.. 0.89 (imprecise)
ifc7 -mp -O3 1.16
ifc7 -mp -O3 -atlas 0.91
trans4 g77.3.1 -goto 0.70
g77.3.1 -atlas 0.85
g77.3.1 0.82
04/2004
suske g77.3.2.3 -fetc 0.64
g77.3.2.3 -O3 -latlas 0.48 10.3x
g77.3.2.3 -O3 -lgoto 0.44 11.2x
compute-0-0 (no X)
pgf77 -fastsse 1.01
pgf77 -03 -latlas 0.81
pgf77 -O3 -lgoto 0.63
08/2005
sanglier-3 (with X) BEST: 0.49
gcc -O3 0.60
gcc -O3 -fetc 0.65
gcc -O3 -atlas 0.54
pgf77 -fastsse -O3 -atlas 0.49
12/2005
kawamori (no X) BEST: 0.43
gfortran -O3 -latlas 0.50
gfortran -O3 -lacml 0.65 (AMD provided)
gfortran -O3 -lgoto 0.43
pgf77 -fastsse -O3 -atlas 0.52
pgf77 -fastsse -O3 -lblas 0.69
pgf77 -fastsse -O3 -lacml 0.68 (PGI provided)
pgf77 -fastsse -O3 -lgoto 0.43
pathscale -Ofast 0.64
pathscale -O2 -lacml 0.64 (AMD provided)
pathscale -O2 -latlas 0.51
pathscale -O2 -lgoto 0.44
08/2007
nyx240
pgi -O3 0.89
pgi -O3 -acml 0.89
pgi -O3 -atlas 0.68
nyxxeon BEST: 0.22
pgi -O3 0.34
pgi -O3 -acml 0.39
pgi -O3 -atlas 0.29
pgi -O3 -goto 0.22
ifort -O3 -fast 0.28
ifort -O3 -fast -acml 0.42
ifort -O3 -fast -atlas 0.30
ifort -O3 -fast -goto 0.23
nyx2218 BEST: 0.31
pgi -O3 0.39
pgi -O3 -acml 0.40
pgi -O3 -atlas 0.46
pgi -O3 -goto 0.31
nyx2220 BEST: 0.29
pgi -O3 0.35
pgi -O3 -acml 0.37
pgi -O3 -atlas 0.42
pgi -O3 -goto 0.29
06/2008
legato
ifort -O3 0.45
ifort -O3 -atlas 0.34
ifort -O3 -goto 0.31
pgi -O3 0.38
pgi -O3 -acml 0.38
pgi -O3 -atlas 0.33
pgi -O3 -goto 0.30
gfortran -O3 0.46
gfortran -atlas 0.35
gfortran -goto 0.30
11/2010
flux
ifort -O3 0.30
ifort -fast 0.29
ifort -fast -goto 0.23
ifort -O3 -mkl 0.22
ifort -fast -mkl 0.20
Notes:
900_4: direct solution technique, penalty function, quadratic triangles
901_7: iterative solution technique, integrated method, quadratic triangles
900_4_CPU (s) 901_7_CPU (s)
machine compiler 20x20 60x60 20x20 40x40
beatrice f77 -O3 -64 -lblas 4.38 84.79 10.15 90.3
panoramix g77 -O 4.07 110.41 13.68 111.4
toutatis g77 -O 3.62 12.91
gemini g77 -O 2.14 89.57 8.89 70.8 1.28
sanglier g77 -O -latlas** 1.76 42.4 7.43 56.3
ifc -all 1.07 34.6 4.64 39.5 2.29
new_sanglier
ifc -O1 -mp -zero -save \
-axW -ttp7 0.84 16.6 2.25 16.8
optimusprime
gcc3.2 -O3 -fetc -atlas 0.82 29 3.48 27
jetfire
gcc3.2 -O3 -fetc -atlas 0.81 24 2.99 22.0
Note-1: g77 on the Athlon with Redhat 7.1 is funky for this finite element
code. Code appears generally slower and seems to be quite sensitive to
where the code is compiled and what flags are used. Perhaps related to the
gcc 2.96 problems? Intel compiler is faster and appears a lot more stable.
Note-2: Aggressive optimization with the intel compiler (-O3) without -mp
creates problems. Turns out -lsvml needs to be added at load time to provide
optimized math routines....
Note-3: As is witnessed in this and the previous benchmarks, ifc really takes
some shortcuts with the floating point precision. Sometimes -O1 -mp1
looks ok, but in order to get ansi accuracy you should really use -O1 -mp
and take the performance hit.
01/2003
starscream
gcc3.2 -O3 -fetc -atlas 1.3 20.6 2.58 16.3 (install sepran in 7:42)
07/2003
hotamd gcc -O -atlas 0.71 14.7 2.28 16.9 (install sepran in 5:30)
All machines without X
04/2003:
starscr gcc3.2 -O3 -fetc -atlas 1.27 19.6 2.48 16.0
gcc3.2 -O3 -fetc 1.25 20.4 2.62 17.4
trans01 gcc3.2 -O3 -fetc -atlas 1.12 21.3 2.62 18.0 (install sepran in 7:35)
trans02 gcc3.2 -O3 -fetc -atlas 1.29 19.3 3.05 20.2 (install sepran in 6:27)
gcc3.2 -O3 -fetc 1.17 22.2 2.97 21.2
ifc7 -O3 -mp -atlas 1.29 19.0 2.49 16.1
ifc7 -O3 -mp 1.39 32.7 2.76 20.9
trans4 gcc3.1 -O3 -fetc -atlas 1.28 18.57 2.86 18.36 (install sepran in 5:43)
04/2004
suske gcc3.3 -O3 -atlas 0.50 10.3 1.35 10.6
08/2005
sanglier-3 (with X)
pgf90 -fastsse -O3 -atlas 0.72 11.72 1.39 9.22 (install sepran in 9:48)
gcc3.2.3 -O3 -atlas 0.80 12.05 1.63 10.81
pgf90 -fastsse -O3 -goto 0.63 10.13 1.22 8.54 (goto_p4_512)
12/2005
kawamori (no X)
pgf90 -fastsse -O3 -atlas 0.70 11.94 1.47 10.73
pgf90 -fastsse -O3 -lgoto 0.68 10.22 1.42 10.25
pathf90 -O3 -OPT:Ofast 0.74 11.00 1.36 9.53
pathf90 -O3 -latlas 0.56 10.45 1.35 10.02
pathf90 -O3 -lgoto 0.48 8.69 1.22 9.22 9.8x
08/2007
miyazaki
pgf90 -O3 -fastsse 0.70 11.70 1.59 11.91
pgf90 -O3 -fastsse -atlas 0.76 12.43 1.64 12.35
pgf90 -O3 -fastsse -goto 0.65 10.29 1.50 11.33
nyx240
pgf90 -O3 0.93 15.22 2.00 15.35
pgf90 -O3 -atlas 0.99 16.11 2.05 15.14
nyxxeon
ifort -O3 0.24 5.07 0.58 5.10
ifort -O3 -atlas 0.24 5.48 0.57 5.12
ifort -O3 -goto 0.24 4.88 0.56 4.93 < (18.3x)
pgf90 -O3 0.37 6.14 0.77 6.18
pgf90 -O3 -atlas 0.41 6.68 0.80 6.33
pgf90 -O3 -goto 0.38 5.97 0.75 6.04
06/2008
legato
gnu -O3 -goto 0.39 6.65 1.01 7.53
gnu -O3 -atlas1 0.44 7.93 1.08 8.12
pgi -O3 -fastsse -goto 0.39 6.49 0.91 6.90
sepran0108
gnu -O3 -goto 0.43 6.91 1.06 7.5
pgi -O3 -fastsse -goto 0.46 7.07 0.99 7.17
Notes:
MGO: Small crystal (small problem)
Pv: Large perovskite problem
machine compiler MgO_CPU Pv_CPU
(s) (frac) (s) (frac)
beatrice f77 -O3 -lblas 24.0 1 503 1
panoramix pgf77 -fast -lblas 36.3 0.66 654 0.77
gemini pgf77 -fast -lblas 21.2 1.13 587 0.86
pgf77 -fast -latlas 20.6 1.18 470 1.07
machine compiler cpu
beatrice f77 -O3 115.6
panoramix g77 -O3 -fetc 113.7 1.02
scott_amd g77.2.96.1 -O2 48
g77.2.96.1 -O2 -fetc 45
ifc -O2 42 2.8
new_sanglier ifc -O2 24
ifc -O2 -mp 31.0
ifc -O2 -mp -axW \
-unroll 31.4
gcc3.1 -O2 -fetc 27.4 4.2
01/2003
optimusprime gcc3.2 -O3 -fetc 45.3
jetfire gcc3.2 -O3 -fetc 39.2
04/2003
helmholtz ifc -O2 -mp -axW \
-unroll 25.6
01/2003
starscream gcc3.2 -O3 -fetc 23.2 5.0
08/2003
hotamd gcc3.2 -O3 25.0 (double precision)
04/2003: all machines without X
starscream gcc3.2 -O3 -fetc 23.05
ifc7 -O3 -mp 27.
trans01 gcc3.2 -O3 -fetc 22.1
ifc7 -O3 -mp 25.6
trans02 gcc3.2 -O3 -fetc 23.0
ifc7 -O3 -mp 26.7
suske gcc3.2 -O3 23.3 (double precision)
gcc3.2 -O3 17.4 (single precision) 6.6
sanglier-3 (X)
pgf77 -fastsse -O3 11.6
g77.3.2.3 -O3 -fetc 13.8
kawamori gfortran -fetc 17.56 (double precision)
pathf90 -O3 21.18 (double precision)
gfortran -fetc 17.90 (single precision)
pgf90 -O3 -fastsse 13.79 (single precision)
pathf90 -O3 11.91 (single precision) 9.7