Linux PC single CPU benchmark comparison.

Peter van Keken , March 2001-June 2008.

1. Introduction

This note provides a short benchmark comparison of a number of Pentium, Athlon and Xeon processors in Linux PCs, as well as some older compute servers such as the SGI Origin 2000 and IBM SP2/SP3. Rather than relying on vendor provided estimates of clock speed I've used critical parts of production codes that we use in geophysical modeling.

In summary, the Athlon and Pentium processors appeared to be quite compatible at similar (scaled) clock speeds through the beginning of 2002. Since then the Pentium 4 chips have gained considerable speed over similar Athlons. Large gains can be made by carefully tuning optimization and using optimized BLAS. The smaller cache on faster Pentiums and Athlons (256k L2 instead of 512k L2) may not guarantee scaled performance with CPU speed, although these drawbacks can potentially be mitigated by carefully tuned BLAS.

The results on the IBM (case 2 only) are based on the portable (single CPU) version as well as the optimized sky line solver in ESSL. This makes a rather large difference...

2. Tested configurations.

2a. processors/platforms/OS:

beatrice: SGI Origin 2000 w/ MIPS R10000 @ 195 MHz 4 Mbyte L2 cache. Irix 6.5.
panoramix: Pentium III @ 450 MHz 512 kbyte L2 cache, 100 MHz bus. Redhat 6.2.
toutatis: Pentium III @ 533 MHz 512 kbyte L2 cache, 133 MHz bus. Redhat 6.2.
gemini: AMD Athlon @ 800 MHz 512 kbyte L2 cache, SuSE 6.3
sanglier: AMD Athlon @ 900 MHz 256 kbyte L2 cache, RedHat 7.0 (+DDRAM)
dualin: AMD Athlon @ 1600 MHz 256 kbyte L2 cache, Redhat 7.1 (+DDRAM)
spe01: IBM SP2 166 MHz Power2 processor
spe17: IBM SP3 375 MHz Power3 processor
virgilio: SGI O2 MIPS R12000 @ 300 MHz, 1 Mbyte L2 cache, Irix 6.5
dogmatix: Sun Blade 100 UltraSparc IIe ? MHz, L2: 256 Kbyte Solaris 8
idefix: Sun Ultra 10 UltraSparc IIi 300 MHz L2: 1 Mbyte Solaris 8
valkyrie: Sun Ultra 10 UltraSparc IIi 440 MHz L2: 2 Mbyte Solaris 8
geowall: Sun Netra T1 200 UltraSparc-IIe 500 MHz, L2: 256 Kbyte Solaris 8
big sun: Sun Fire 280R, UltraSparc-III 900 MHz, L2: 1 Gbyte+ Solaris 8
new sanglier: Intel P4 2.0GHz L2: 512 Kbyte RedHat 7.3 1 Gbyte DDRAM
scott_amd: AMD Athlon 2000+ (1667 MHz) L2: 256 Kbyte
jetfire: AMD Athlon XP 2200+ (1800 MHz) L2: 256 Kbyte, 1 Gbyte DDRAM, Asus VIA A7V33, REdhat 8.0, gcc 3.2 sanglier-3Pentium 4 3.2 GHz L2: 2048 Kbyte, 2 Gbyte DDR2, RHE3.5
optimusprime: 2x AMD Athlon MP 2200+ (1800 MHz) L2: 256 Kbyte, 1 Gbyte DDRAM, Asus A7M266D (AMD 760MPX), Redhat 8.0, gcc 3.2
helmholtz2x Intel Xeon 2.4 Gbyte L2: 512K, 2 Gbyte DDRAM, Redhat 7.3
starscream: Pentium 4 2.4 GHz L2: 512 Kbyte, 1 Gbyte DDRAM, Asus P4PE (Intel 82845PE), Redhat 8.0, gcc 3.2
trans01Pentium 4 2.53 GHz L2: 512 Kbyte, 1 Gbyte DDRAM, Redhat 8.0 board?
trans02Pentium 4 2.4 GHz L2: 512 Kbyte, 2 Gbyte DDRAM, Redhat 7.3 board?
trans4Pentium 4 2.53 GHz L2: 512 Kbyte 2 Gbyte DDRAM Redhat 7.3 Intel 850
hotamdAMD Opteron 240 1.4 GHz L2: 1 Mbyte, 2 Gbyte RAM Redhat
suskeAMD64 Opteron 1.8 GHz L2: 1 Mbyte, 4 Gbyte DDRAM, Redhat professional 9.0 (2.4.21-9.EL) Tyan S2880
kawamoriDual core AMD Opteron (2x1.8 GHz) L2: 1 Mbyte, 2 Gbyte DDRAM, Fedora Core 3.
nyxxeonDual dual core Intel Xeon 5610 (4x3.0 GHz) L2: 4 Mbyte, 8 Gbyte DDRAM
nyx240Dual single core Opteron240 (2x1.4 GHz) L2: 1 Mbyte, 2 Gbyte DDRAM
nyx2218Dual dual core Opteron2218 (4x2.6 GHz) L2: 1 Mbyte, 8 Gbyte DDRAM
nyx2220Dual dual core Opteron2220 (4x2.8 GHz) L2: 1 Mbyte, 8 Gbyte DDRAM
legatoDual dual core Opteron2200 (4x2.8 GHz) L2: 1 Mbyte, 16 Gbyte DDRAM
fluxDual six core Intel Nehalem (Xeon X5650 @ 2.67 GHz; cache 12Mbyte; 50 Gbyte).

2b. Compilers and codes

On the linux machines the tests were performed with code compiled by the OS provided gnu compilers, the Portland group compilers (www.pgroup.com), the free but unsupported Intel compilers (www.intel.com) and the EKO Pathscale compilers (www.pathscale.com). For SGI, IBM, and Sun I used the local optimizing compilers. Four specific cases were investigated. Two are based on long standing, small problems that provide a good idea of CPU performance that can be expected for finite element modeling. Two others are simple tests based on full production codes for geophysical fluid dynamics and solid state physics.

Case 1: Linpack: Linpack style benchmark. Solution of NxN full matrix using Linpack/blas. Non-optimized linpack, so it really gives you the performance that you'd get when implementation algebraic routines using (optimized) blas. Perhaps not the most elegant, but fairly typical of most user codes.

Case 2: Profile: decomposition of a matrix-vector system in profile storage. Workhorse for 2D finite element codes that solve the Navier-Stokes equation with the penalty function method.

Case 3: SEPRAN: 2D thermal convection using finite element approach. Tested are both penalty function method (direct solution) and integrated methods (with CG style iterative techniques)

Case 4: VASP: computation of vibrational spectrum of MgO and perovskite.

Compilers used:
SGI: f77 = MIPS f77 compiler;
PC: g77 = gnu fortran (2.96?); pgf77 = Portland Group Fortran compiler; ifc = Intel compilers
IBM: xlf
Sun: f77

We use optimized blas (when indicated) using the appropriate libraries from the ATLAS project (-latlas) or those developed by Kazushige Goto (-lgoto). On the 64 bit Opteron we also tried the AMD Math Core Library that is suggested to provide optimized BLAS.

3. Results.

Notes:
* -fETC means the combination of -ffast-math -fexpensive-optimizations -funroll-all-loops
** -latlas means use of optimized BLAS routines from Atlas project (www.netlib.org/atlas)
*** same, but now compiled with the 3Dnow technology
**** -all for Intel is: -O3 -unroll -vec -pad -axM -ipo
-latlasP4SSE2: precompiled blas from the Atlas project (www.netlib.org/atlas) for the Pentium 4.

3a. Case 1 Linpack. NxN matrix with N=300, 1000, 2000. User supplied blas unless indicated.
machine      compiler                cpu time (s)               improvement over best beatrice
                                 N= 300   1000   2000  
beatrice  f77 -O3 -64              0.23  11.06  104.7           1.00
          f77 -O3 -64 -lblas       0.23  10.90  105.4           0.99

panoramix g77 -O                   0.4   17.4   159.2       
          g77 -O -fETC*            0.28  15.8   127.1        
          ifc -all****             0.27  12.7   115.0 
          ifc -all -tpp6           0.26  12.7   114.8           0.91

toutatis  g77 -O3                  0.44  22.8   190.6                
          g77 -O2 -fETC*           0.28  15.6   127             0.82
          pgf77 -fast -O2          0.29  15.3   127.6          
          g77 -O                   0.38  20.8   170             

gemini    g77 -O3                  0.24  15.4   125.5            
          g77 -O2 -fETC*           0.21  12.8   104.7             
          g77 -O -fETC*            0.21  12.8   104.0           
          pgf77 -fast -O2          0.16  10.4    81             1.30    
          pgf77 -fast -lblas       0.2   12.5   100.7                

sanglier  g77 -O                   0.20  10.0   82.3
          g77 -O -atlas            0.22  10.3   83
          g77 -O -fETC             0.17   7.8   63.4            1.67
          ifc -O3 -atlas           0.22  10.4   83.9
          ifc -O3 -vec             0.17   7.8   63.5            1.67

dualin    g77 -O3                  0.15   6.6    52.9           2.00     

idefix    f77 -fast                0.3   19.8   160.4           0.65    

dogmatix  f77 -fast                0.3   16.9   137.4           0.76 

valkyrie  f77 -fast                0.14  14.5   124.2           0.85 

geowall   f77 -fast                0.28  14.7   119.46          0.88     

big sun   f77 -fast                0.1    3.9    82.4           1.30

new sanglier
	  ifc -O3 -axW -tpp7 -vec \
	      -unroll -mp          0.07   4.3    38.8
	  ifc -O3 -axW -tpp7 -vec \
	      -unroll -mp -ipo     0.07   4.3    38.8
	  g77.3.1 -O3              0.07   4.2    38.5

scott_amd
          g77.2.96.1 -O3           0.15   6.5    50.2
	  ifc -O3                  0.14   6.0    44.0           2.38
	  ifc -O3 -axW -vec \      0.16   7.0    52.7
	      -ipo -mp
01/2003
starscream 
          g77.3.2 -O3              0.06   3.3    28.1
          g77.3.2 -O3 -atlas       0.05   3.5    27.0
          ifc7 -O3  (-mp)          0.06   3.9    30.8               
          ifc7 -O3 -altas (-mp)    0.06   3.9    30.9

optimusprime
          g77.3.2 -O3              0.13   6.2    50.3
          g77.3.2 -O3 -atlas       0.12   5.0    40.1

jetfire
	  g77.3.2  -O3             0.11   5.0    39.9
          g77.3.2 -O3 -atlas       0.12   4.4    33.9

helmholtz
          ifc -O3 -axW -vec \
               -ipo -mp            0.05   3.1    23.3

hotamd    g77.3.3.2 -O             0.36   2.8    26.7
          g77.3.3.2 -atlas         0.48   2.6    19.3  (atlas3.5.6: HAMMER64SSE2_2)
          g77.3.3.2 -goto          0.23   3.3    24.9


All machines without X 04/2003: starscream g77.3.2 -O3 0.06 3.87 28.1 g77.3.2 -O3 -atlas 0.05 3.55 27.6 ifc7 -O3 (-mp) ifc7 -O3 -atlas (-mp) trans01 g77.3.2 -O3 -fetc 0.06 4.06 27.6 g77.3.2 -O3 -atlas 0.06 4.08 27.8 ? ifc -O3 (-mp) 0.06 4.06 27.6 ifc -O3 -atlas (-mp) 0.05 3.75 26.8 trans02 g77.3.2 -O3 -fetc 0.08 3.68 29.5 g77.3.2 -O3 -atlas 0.06 3.73 29.6 ? ifc -O3 (-mp) 0.06 3.76 29.6 ifc -O3 -atlas (-mp) 0.05 3.53 29.4 ? trans4 g77.2.96 0.04 1.95 15.21 g77.2.96 -atlas 0.04 1.90 15.8 04/2004: suske (X) g77.3.2 -O3 0.03 1.70 14.1 g77.3.2 -O3 -latlas 0.04 1.24 10.1 g77.3.2 -O3 -lgoto 0.03 1.68 14.0 compute-0-0 (no X) pgf77 -O3 0.06 1.96 14.7 pgf77 -O3 -lgoto 0.04 1.94 14.7 pgf77 -O3 -lblas 0.04 1.94 18.9 (why?) 08/2005: sanglier-3 (with X) pgf77 -fastsse -O3 -latlas 0.02 1.36 11.5 (Linux_P4SSE2) g77 -O3 -latlas 0.03 1.49 11.6 12/2005 kawamori (no X) pgf77 -O3 -fastsse -lblas 0.025 1.71 13.1 (PGI supplied blas) pgf77 -O3 -fastsse -lacml 0.025 1.72 12.8 (PGI supplied acml) pgf77 -O3 0.025 1.56 12.43 pgf77 -O3 -latlas 0.030 1.24 9.97 (Linux_HAMMER64SSE2) pgf77 -O3 -fastsse -latlas 0.030 1.25 9.90 gfortran -O3 -lacml 0.025 1.58 12.58 (acml from AMD) gfortran -O3 0.025 1.54 12.29 gfortran -O3 -lgoto 0.025 1.39 10.78 gfortran -O3 -latlas 0.025 1.56 9.96 pathscale -O3 0.025 1.59 12.33 pathscale -Ofast 0.025 1.58 12.29 pathscale -Ofast -lacml 0.025 1.59 12.69 (AMD supplied) pathscale -O3 -lgoto 0.019 1.39 10.80 pathscale -O3 -latlas 0.030 1.25 9.89 08/2007 miyazaki (single dual core Opteron 1.8GHz 8 Gbyte DDRAM) gfortran -O3 0.03 2.10 16.6 gfortran -O3 -atlas 0.04 1.49 13.03 gfortran -O3 -goto 0.02 2.0 15.4 pgf95 -O3 0.03 2.14 16.9 pgf95 -O3 -atlas 0.04 1.51 13.0 pgf95 -O3 -goto 0.02 1.87 14.4 nyx240 BEST: 14.4 pgi -O3 0.034 2.59 22.35 pgi -O3 -atlas 0.048 1.72 14.4 pgi -O3 -goto 0.025 1.98 16.1 nyxxeon BEST: 9.0 pgi -O3 0.085 0.72 9.41 pgi -O3 -atlas 0.015 0.71 9.05 pgi -O3 -acml 0.013 0.85 9.86 pgi -O3 -goto 0.058 0.65 9.05 ifort -O3 0.089 0.74 9.28 ifort -O3 -fast 0.063 0.68 9.15 ifort -O3 -atlas 0.098 0.71 9.01 nyx2218 BEST: 7.0 pgi -O3 pgi -O3 -acml 0.018 1.40 10.92 pgi -O3 -atlas 0.020 0.86 7.00 pgi -O3 -goto 0.013 1.16 8.92 nyx2220 BEST: 7.0 pgi -O3 0.016 1.32 10.71 pgi -O3 -acml 0.017 1.27 10.31 pgi -O3 -atlas 0.017 0.82 7.06 (!) pgi -O3 -goto 0.012 1.19 9.62 06/2008 legato: BEST 6.71 ifc -O3 0.017 1.24 9.80 ifc -O3 -acml 0.017 1.22 9.76 ifc -O3 -goto 0.012 1.04 8.00 ifc -O3 -atlas 0.021 0.86 6.78 pgi -O3 -fastsse 0.016 1.27 10.03 pgi -O3 .. -acml 0.018 1.24 9.80 pgi -O3 -atlas1 0.021 0.83 6.68 using default atlas (rpm) pgi -O3 -atlas2 0.016 1.04 8.04 using 'tuned' atlas (/opt/packages) gfortran -O3 0.017 1.23 9.91 gfortran -O3 -atlas1 0.021 0.86 6.71 < gfortran -O3 -atlas2 0.024 0.84 6.71 gfortran -O3 -goto 0.013 1.00 7.74

3b. Case 2 Profile.

machine      compiler                cpu time (s)  improvement over beatrice

beatrice  f77 -O3 -64 -lblas           4.92          1.00
          
panoramix pgf77 -fast -O3 -lblas       5.13          
          pgf77 -fast -O3 -latlas**    4.73          1.04x
          g77 -O                       5.13          
          ifc -O3                      4.23
          ifc -all -tpp6               4.20

toutatis  pgf77 -fast -O -latlas**     4.22          
          pgf77 -fast -O3 -lblas       5.13          
          g77 -O                       5.07          1.17x
          g77 -O -latlas**             4.30         

gemini    pgf77 -fast -O3 -lblas       3.68          
          pgf77 -fast -O3 -latlas**    2.90         
          g77 -O                       2.87         
          g77 -O3 -fETC*               3.83        
          g77 -O2 -fETC*               3.79        
          g77 -O -latlas**             2.88          1.71x

sanglier  g77 -O -latlas***            2.87          
          ifc -all                     2.78          1.77x
          ifc -all -latlas             2.89

dualin    g77 -O3                      2.31          2.13x


spe04     xlf -O4 -lblas              12.0           0.41x
          xlf -O4 -lessl              12.8

spe17     xlf -lblas                   2.7           
          xlf -O4 -lblas               2.2          
          xlf -O4 -lessl               1.9           2.60x

Using profile solver DSKFS from ESSL instead of PROFGG:
spe17     xlf -O4 -lessl               0.78          6.30x

virgilio  f77 -O3 -G0 -static          3.59          1.37x

dogmatix  f77 -O3                      8.23         
          f77 -fast                    5.43          0.91x
 
idefix    f77 -fast                    6.06          0.81x

valkyrie  f77 -fast                    3.28          1.50x

geowall   f77 -fast                    5.00          1.02x

big sun   f77 -fast                    2.55          1.92x

new sanglier
          ifc -O3 -ipo -axW -vec       1.00   (floating point inaccurate: 1e-5)
          ifc -O3 -ipo -axW -vec -mp1  1.08   (floating point inaccurate: 1e-5)
	  ifc -O3 -mp                  1.35   (accurate to 7 digits)
	  ifc -O3 -ipo -axW -vec -mp   1.28   (accurate to 7 digits)
	  ifc -O3 -ipo -axW -vec \
	      -unroll -tpp7 -mp        1.02   (accurate to 7 digits)
          ifc -O3etc -latlasP4SSE2     1.00   (accurate to 7 digits)
          g77.2.26.1 -O3 -fetc         1.21   (accurate to 7 digits)
          g77.2.26.1 -O3               1.12   (accurate to 7 digits)
	  g77.2.96.1 -O3 -latlasP4SSE2 1.12   (accurate to 7 digits)
          g77.3.1 -O3                  1.05   (accurate to 7 digits)  4.68
	  g77.3.1 -O3 -latlasP4SSE2    1.06   (accurate to 7 digits)

scott_amd
          f77.2.96.1 -O3 -fetc         2.49   (accurate to 7 digits)
	  ifc -axW -vec -ipo -O3 -mp   2.40   (accurate to 7 digits)  2.05
	  ifc -axW -vec -ipo -O3       1.91   (1e-5 inaccurary)
01/2003
starscream
          f77.3.2 -fetc                1.04
          f77.3.2 -O3 -atlas           0.92
optimusprime 
          f77.3.2 -fetc                2.67
          f77.3.2 -O3 -atlas           2.17
jetfire 
          f77.3.2 -fetc                2.1
          f77.3.2 -O3 -atlas           1.7
04/2003
helmholtz  
	  ifc -axW -vec -ipo -O3 -mp   1.17   (accurate to 7 digits)  
	  ifc -axW -vec -ipo -O3       0.85   (1e-5 inaccurary)

hotamd    g77.3.3.2 -fetc              0.82
          g77.3.3.2 -fetc -atlas       0.63

All machines without X 04/2003: starscream g77.3.2 -fetc 0.97 g77.3.2 -fetc -atlas 0.86 ifc7 -O3 -vec.. 0.88 (imprecise) ifc7 -mp -O3 1.14 ifc7 -mp -O3 -atlas 0.90 trans01 g77.3.2 -fetc 1.00 g77.3.2 -fetc -atlas 0.86 ifc7 -O3 -vec.. 0.85 (imprecise) ifc7 -mp -O3 1.11 ifc7 -mp -O3 -atlas 0.86 trans02 g77.3.2 -fetc 0.99 g77.3.2 -fetc -atlas 0.94 ifc7 -O3 -vec.. 0.89 (imprecise) ifc7 -mp -O3 1.16 ifc7 -mp -O3 -atlas 0.91 trans4 g77.3.1 -goto 0.70 g77.3.1 -atlas 0.85 g77.3.1 0.82 04/2004 suske g77.3.2.3 -fetc 0.64 g77.3.2.3 -O3 -latlas 0.48 10.3x g77.3.2.3 -O3 -lgoto 0.44 11.2x compute-0-0 (no X) pgf77 -fastsse 1.01 pgf77 -03 -latlas 0.81 pgf77 -O3 -lgoto 0.63 08/2005 sanglier-3 (with X) BEST: 0.49 gcc -O3 0.60 gcc -O3 -fetc 0.65 gcc -O3 -atlas 0.54 pgf77 -fastsse -O3 -atlas 0.49 12/2005 kawamori (no X) BEST: 0.43 gfortran -O3 -latlas 0.50 gfortran -O3 -lacml 0.65 (AMD provided) gfortran -O3 -lgoto 0.43 pgf77 -fastsse -O3 -atlas 0.52 pgf77 -fastsse -O3 -lblas 0.69 pgf77 -fastsse -O3 -lacml 0.68 (PGI provided) pgf77 -fastsse -O3 -lgoto 0.43 pathscale -Ofast 0.64 pathscale -O2 -lacml 0.64 (AMD provided) pathscale -O2 -latlas 0.51 pathscale -O2 -lgoto 0.44 08/2007 nyx240 pgi -O3 0.89 pgi -O3 -acml 0.89 pgi -O3 -atlas 0.68 nyxxeon BEST: 0.22 pgi -O3 0.34 pgi -O3 -acml 0.39 pgi -O3 -atlas 0.29 pgi -O3 -goto 0.22 ifort -O3 -fast 0.28 ifort -O3 -fast -acml 0.42 ifort -O3 -fast -atlas 0.30 ifort -O3 -fast -goto 0.23 nyx2218 BEST: 0.31 pgi -O3 0.39 pgi -O3 -acml 0.40 pgi -O3 -atlas 0.46 pgi -O3 -goto 0.31 nyx2220 BEST: 0.29 pgi -O3 0.35 pgi -O3 -acml 0.37 pgi -O3 -atlas 0.42 pgi -O3 -goto 0.29 06/2008 legato ifort -O3 0.45 ifort -O3 -atlas 0.34 ifort -O3 -goto 0.31 pgi -O3 0.38 pgi -O3 -acml 0.38 pgi -O3 -atlas 0.33 pgi -O3 -goto 0.30 gfortran -O3 0.46 gfortran -atlas 0.35 gfortran -goto 0.30 11/2010 flux ifort -O3 0.30 ifort -fast 0.29 ifort -fast -goto 0.23 ifort -O3 -mkl 0.22 ifort -fast -mkl 0.20

3c. Sepran.

Notes:
900_4: direct solution technique, penalty function, quadratic triangles
901_7: iterative solution technique, integrated method, quadratic triangles

                                  900_4_CPU (s)   901_7_CPU (s)
machine   compiler                20x20  60x60    20x20  40x40
beatrice  f77 -O3 -64 -lblas      4.38   84.79    10.15   90.3
panoramix g77 -O                  4.07  110.41    13.68  111.4
toutatis  g77 -O                  3.62            12.91
gemini    g77 -O                  2.14   89.57     8.89   70.8    1.28
sanglier  g77 -O -latlas**        1.76   42.4      7.43   56.3
          ifc -all                1.07   34.6      4.64   39.5    2.29
new_sanglier   
	  ifc -O1 -mp -zero -save \
	    -axW -ttp7            0.84   16.6      2.25   16.8
optimusprime
          gcc3.2 -O3 -fetc -atlas 0.82   29        3.48   27
jetfire
          gcc3.2 -O3 -fetc -atlas 0.81   24        2.99   22.0        

Note-1: g77 on the Athlon with Redhat 7.1 is funky for this finite element
   code. Code appears generally slower and seems to be quite sensitive to 
   where the code is compiled and what flags are used. Perhaps related to the 
   gcc 2.96 problems?  Intel compiler is faster and appears a lot more stable. 
Note-2: Aggressive optimization with the intel compiler (-O3) without -mp
   creates problems. Turns out -lsvml needs to be added at load time to provide
   optimized math routines....
Note-3: As is witnessed in this and the previous benchmarks, ifc really takes
   some shortcuts with the floating point precision. Sometimes -O1 -mp1 
   looks ok, but in order to get ansi accuracy you should really use -O1 -mp 
   and take the performance hit. 

01/2003
starscream
          gcc3.2 -O3 -fetc -atlas 1.3    20.6      2.58   16.3  (install sepran in 7:42)

07/2003
hotamd    gcc -O -atlas           0.71   14.7      2.28   16.9  (install sepran in 5:30)

All machines without X 04/2003: starscr gcc3.2 -O3 -fetc -atlas 1.27 19.6 2.48 16.0 gcc3.2 -O3 -fetc 1.25 20.4 2.62 17.4 trans01 gcc3.2 -O3 -fetc -atlas 1.12 21.3 2.62 18.0 (install sepran in 7:35) trans02 gcc3.2 -O3 -fetc -atlas 1.29 19.3 3.05 20.2 (install sepran in 6:27) gcc3.2 -O3 -fetc 1.17 22.2 2.97 21.2 ifc7 -O3 -mp -atlas 1.29 19.0 2.49 16.1 ifc7 -O3 -mp 1.39 32.7 2.76 20.9 trans4 gcc3.1 -O3 -fetc -atlas 1.28 18.57 2.86 18.36 (install sepran in 5:43)
04/2004 suske gcc3.3 -O3 -atlas 0.50 10.3 1.35 10.6 08/2005 sanglier-3 (with X) pgf90 -fastsse -O3 -atlas 0.72 11.72 1.39 9.22 (install sepran in 9:48) gcc3.2.3 -O3 -atlas 0.80 12.05 1.63 10.81 pgf90 -fastsse -O3 -goto 0.63 10.13 1.22 8.54 (goto_p4_512) 12/2005 kawamori (no X) pgf90 -fastsse -O3 -atlas 0.70 11.94 1.47 10.73 pgf90 -fastsse -O3 -lgoto 0.68 10.22 1.42 10.25 pathf90 -O3 -OPT:Ofast 0.74 11.00 1.36 9.53 pathf90 -O3 -latlas 0.56 10.45 1.35 10.02 pathf90 -O3 -lgoto 0.48 8.69 1.22 9.22 9.8x 08/2007 miyazaki pgf90 -O3 -fastsse 0.70 11.70 1.59 11.91 pgf90 -O3 -fastsse -atlas 0.76 12.43 1.64 12.35 pgf90 -O3 -fastsse -goto 0.65 10.29 1.50 11.33 nyx240 pgf90 -O3 0.93 15.22 2.00 15.35 pgf90 -O3 -atlas 0.99 16.11 2.05 15.14 nyxxeon ifort -O3 0.24 5.07 0.58 5.10 ifort -O3 -atlas 0.24 5.48 0.57 5.12 ifort -O3 -goto 0.24 4.88 0.56 4.93 < (18.3x) pgf90 -O3 0.37 6.14 0.77 6.18 pgf90 -O3 -atlas 0.41 6.68 0.80 6.33 pgf90 -O3 -goto 0.38 5.97 0.75 6.04 06/2008 legato gnu -O3 -goto 0.39 6.65 1.01 7.53 gnu -O3 -atlas1 0.44 7.93 1.08 8.12 pgi -O3 -fastsse -goto 0.39 6.49 0.91 6.90 sepran0108 gnu -O3 -goto 0.43 6.91 1.06 7.5 pgi -O3 -fastsse -goto 0.46 7.07 0.99 7.17

3d. VASP.

Notes:
MGO: Small crystal (small problem)
Pv: Large perovskite problem

machine   compiler             MgO_CPU       Pv_CPU 
                               (s) (frac) (s)   (frac)
beatrice  f77 -O3 -lblas      24.0   1     503    1
panoramix pgf77 -fast -lblas  36.3   0.66  654    0.77
gemini    pgf77 -fast -lblas  21.2   1.13  587    0.86
          pgf77 -fast -latlas 20.6   1.18  470    1.07

3e. Conman.

Conman convection code by Scott King (scott@eas.purdue.edu). Blankenbach Benchmark 1.
machine          compiler               cpu
beatrice         f77 -O3                115.6
panoramix        g77 -O3 -fetc          113.7    1.02
scott_amd        g77.2.96.1 -O2          48
                 g77.2.96.1 -O2 -fetc    45
		 ifc -O2                 42      2.8
new_sanglier     ifc -O2                 24
                 ifc -O2 -mp             31.0
		 ifc -O2 -mp -axW \
		    -unroll              31.4    
		 gcc3.1 -O2 -fetc        27.4    4.2
01/2003
optimusprime     gcc3.2 -O3 -fetc        45.3
jetfire          gcc3.2 -O3 -fetc        39.2
04/2003
helmholtz        ifc -O2 -mp -axW \
                    -unroll              25.6
01/2003
starscream       gcc3.2 -O3 -fetc        23.2    5.0

08/2003
hotamd           gcc3.2 -O3              25.0  (double precision)


04/2003: all machines without X starscream gcc3.2 -O3 -fetc 23.05 ifc7 -O3 -mp 27. trans01 gcc3.2 -O3 -fetc 22.1 ifc7 -O3 -mp 25.6 trans02 gcc3.2 -O3 -fetc 23.0 ifc7 -O3 -mp 26.7 suske gcc3.2 -O3 23.3 (double precision) gcc3.2 -O3 17.4 (single precision) 6.6
sanglier-3 (X) pgf77 -fastsse -O3 11.6 g77.3.2.3 -O3 -fetc 13.8 kawamori gfortran -fetc 17.56 (double precision) pathf90 -O3 21.18 (double precision) gfortran -fetc 17.90 (single precision) pgf90 -O3 -fastsse 13.79 (single precision) pathf90 -O3 11.91 (single precision) 9.7

4. Acknowledgments.

I'd like to thank Hugh Aller, Gerd Steinle-Neumann, Tom Hacker, Abhijit Bose, Andy Caird, and Brock Palen for providing access to their machines for testing and technical support, and Boris Kiefer for setting up the VASP benchmark and Scott King for providing the ConMan examples.