CUDA vs Quad Core Performance Test

December 21, 2010

Fixed. I had some spare cycles during lunch break.

This update supersedes all previous versions, which have been removed (link expires 12/8).

Changes

Updated HT detection code to eliminate false positives on Vista and newer operating systems, and reduce false positives on pre-Vista operating systems
Reduced settling period to eight seconds between runs and added a two second spool up period for the parallel runs

December 22, 2010

The X3470 (ES) that I have is identical to the i7 875K but with ECC and VT-d plus some other stuff.

Had hoped that the unlocked version would be unlocked but it's not really, still only get 9-22x Multiplier.

But easily does 4Ghz, if I enable Turbo then 3.8Ghz (4.4Ghz)

Makes the desktop very snappy compared even to the i3 @ 4.4Ghz

December 22, 2010

X3470 (ES) @ default 2.93Ghz

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8
Matrix Size: 1536x1536

Processing time for CPU (sequential): 33578 ms

Processing time for CPU (parallel HT on): 9511 ms

Processing time for CPU (parallel HT off): 14210 ms

Parallel speed-up: 3.530438, Efficiency: 88.260961% (HT on)

Parallel speed-up: 2.362984, Efficiency: 59.074595% (HT off)

Thats better :)

Hmmm, my HT off makes for quite a difference.

Looks like HT can do some actual work

December 23, 2010

I did major refactoring of the code to see how far I could push the 'envelope' in a managed language like C#; the results are not disappointing. :)

Changes

Now accepts command line options for multiplication algorithm and matrix size.
Totally refactored matrix multiplication code. Program now implements two algorithms -- a naive version using a straightforward i,j,k loop and Strassen's algorithm.
HT off simulation should work more consistently now.

For those interested, the latest version of the program can be found here (link expires 12/30). Links to old versions have been removed.

December 24, 2010

All cores are not created equal. On my i7-950 the second core (CPU 1) is fastest, whereas on both my Core i7-620M and Core 2 Duo E8500 the first core (CPU 0) is fastest.

Changes

Sequential computation is now performed on each available core. The best runtime is used for the speed-up and efficiency calculations.
Added a dry run prior to the timed runs to negate any ramp-up associated with dynamic overclocking.

For those interested, the latest version of the program can be found here (link expires 12/31). Links to old versions have been removed.

Merry Christmas and enjoy! :)

December 25, 2010

Core i5 460m

CPU ID: Intel64 Family 6 Model 37 Stepping 5, Cores: 2, Logical Processors: 4
Algorithm: Strassen's, Matrix Size: 1536x1536

Processing time for CPU 0 (sequential): 33141.6495 ms

Processing time for CPU 1 (sequential): 33882.3437 ms

Processing time for CPU * (parallel HT on ): 15413.6534 ms

Processing time for CPU * (parallel HT off): 18159.8282 ms

Parallel speed-up: 2.150149, Efficiency: 107.507444% (HT on)

Parallel speed-up: 1.824998, Efficiency: 91.249898% (HT off)

December 28, 2010

Changes

Added CUDA Toolkit 3.2 integration (finally)!
CUDA matrix multiplications are performed on each available GPU using the CUBLAS optimized library.

For those interested, the latest version of the program can be found here (link expires 1/4). Links to old versions have been removed.

December 29, 2010

Very nice app now :)

Xeon x3470 @ 2.93Ghz

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8
CUDA GPU: GeForce GTX 470, Cores: 448

Algorithm: Strassen's, Matrix Size: 1536x1536

Processing time for CPU 0 (sequential): 8445.7032 ms

Processing time for CPU 1 (sequential): 8177.6437 ms

Processing time for CPU 2 (sequential): 8141.7464 ms

Processing time for CPU 3 (sequential): 8193.7191 ms

Processing time for CPU * (parallel HT): 2202.4177 ms

Processing time for CPU * (parallel !HT): 2304.8124 ms

Processing time for GPU 0 (CUBLAS): 25.3861 ms

Parallel speed-up: 3.696731, Efficiency: 92.418282% ( HT)

Parallel speed-up: 3.532499, Efficiency: 88.312463% (!HT)

CUDA GPU speed-up: 320.716707

December 30, 2010

Thanks. All I did was take what Bill started, and pursued it down the path of whatever tickled my fancy. Hats off to Bill for starting this thread. :)

Time permitting, I'm looking at SSE SIMD instructions and OpenCL for upcoming versions. This should get us really close to the limits of CPU performance and cater to the AMD Radeon owners out there.

December 30, 2010

Yes Bill the creator of all this does need kudos too :)

He's too busy playing with his Panda box, we are neglected children

December 30, 2010

I haven't spent too much time with it, but I have built it into a case of sorts, so it actually resembles a tablet now. (Maybe I'll have pictures tomorrow)

However between driver issues (new so lack of drivers) and overheating it needs some work before I can really make use of it. (need a fan or heat sink in the enclosed area)

I also need to find a good touch screen to attach to it. Also will need to figure out how to set a good always-on on screen keyboard.

I did get OpenMP running on it though. I will post up some numbers later when I get a chance to run more tests.

Its one of the things that runs like it should. (wifi and bluetooth work as well)

I will need to compare it to a core 2 machine running at 1 GHz.

Between Christmas, birthdays, snow (The world is ending with the snow we are having here.), getting my car fixed multiple times, and driving/sliding all across town in snow/ice every day I've been busier over the break than before it started.

We brought this new sport to Alabama called Yukon Ice Skating.

January 4, 2011

Changes

Totally re-written GPGPU module that uses OpenCL* instead of CUDA
Multi-GPU parallel compute on systems with more than one GPU

[*] I have very limited access to Radeon cards, so it might not work 100% on that architecture. For those brave souls willing to give it a shot, the latest AMD Catalyst Accelerated Parallel Processing (APP) Technology Edition driver is required.

For those interested, the latest version of the program can be found here (link expires 1/10). Links to old versions have been removed.

January 4, 2011

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8
GPGPU: GeForce GTX 470, Compute Units: 14

Algorithm: Strassen's, Matrix Size: 1536x1536

Processing time for CPU 0 (sequential): 7028.4161 ms

Processing time for CPU 1 (sequential): 6803.7261 ms

Processing time for CPU 2 (sequential): 6738.3781 ms

Processing time for CPU 3 (sequential): 6796.9911 ms

Processing time for CPU * (parallel HT): 1837.8873 ms

Processing time for CPU * (parallel !HT): 1917.5827 ms

Processing time for GPU 0 (OpenCL): 51.9128 ms

Parallel speed-up: 3.666372, Efficiency: 91.659294% ( HT)

Parallel speed-up: 3.513996, Efficiency: 87.849902% (!HT)

GPGPU speed-up: 129.801862

I didn't get any of the GPU OpenCL tests, GPUz says it's enabled.

But then again, it says PhysX isn't but it is.

I see the GTX460 has 7+7 Compute units and the GTX470 has 14

January 4, 2011

The GTX460 has 7 compute units. Mine's an SLI system, thus 7 + 7.

The OpenCL results are there - fourth line from the bottom.

Wow, the 470 is 32% faster than the 460 in the OpenCL test. That's impressive. :)

January 4, 2011

Ofcourse it is, I forgot your SLi setup :doh:

I noticed that too.

I'm running the CPU at 3.84Ghz (4.3Ghz Turbo) this might boost the performance.

The GTX470 is running stock clocks.

I may run some more scenarios, for something todo

January 4, 2011

CPU default 2.93Ghz (133x22, Turbo on)

GPU default 608/837/1215Mhz

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8
GPGPU: GeForce GTX 470, Compute Units: 14

Algorithm: Strassen's, Matrix Size: 1536x1536

Processing time for CPU 0 (sequential): 8171.4389 ms

Processing time for CPU 1 (sequential): 8156.402 ms

Processing time for CPU 2 (sequential): 8294.0286 ms

Processing time for CPU 3 (sequential): 8581.3664 ms

Processing time for CPU * (parallel HT): 2269.4997 ms

Processing time for CPU * (parallel !HT): 2317.3234 ms

Processing time for GPU 0 (OpenCL): 55.0925 ms

Parallel speed-up: 3.593921, Efficiency: 89.848018% ( HT)

Parallel speed-up: 3.519751, Efficiency: 87.993782% (!HT)

GPGPU speed-up: 148.049226

CPU overclocked to 3520Mhz (160x22, Turbo on)

GPU default 608/837/1215Mhz

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8
GPGPU: GeForce GTX 470, Compute Units: 14

Algorithm: Strassen's, Matrix Size: 1536x1536

Processing time for CPU 0 (sequential): 7028.4161 ms

Processing time for CPU 1 (sequential): 6803.7261 ms

Processing time for CPU 2 (sequential): 6738.3781 ms

Processing time for CPU 3 (sequential): 6796.9911 ms

Processing time for CPU * (parallel HT): 1837.8873 ms

Processing time for CPU * (parallel !HT): 1917.5827 ms

Processing time for GPU 0 (OpenCL): 51.9128 ms

Parallel speed-up: 3.666372, Efficiency: 91.659294% ( HT)

Parallel speed-up: 3.513996, Efficiency: 87.849902% (!HT)

GPGPU speed-up: 129.801862

CPU default 2.93Ghz

GPU Overclocked to 850/975/1700

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8
GPGPU: GeForce GTX 470, Compute Units: 14

Algorithm: Strassen's, Matrix Size: 1536x1536

Processing time for CPU 0 (sequential): 8242.8043 ms

Processing time for CPU 1 (sequential): 8042.2601 ms

Processing time for CPU 2 (sequential): 8056.3077 ms

Processing time for CPU 3 (sequential): 8095.631 ms

Processing time for CPU * (parallel HT): 2187.2142 ms

Processing time for CPU * (parallel !HT): 3161.9652 ms

Processing time for GPU 0 (OpenCL): 42.6647 ms

Parallel speed-up: 3.676942, Efficiency: 91.923554% ( HT)

Parallel speed-up: 2.543437, Efficiency: 63.585931% (!HT)

GPGPU speed-up: 188.499168

CPU Overclocked to 3520Mhz (160x22, Turbo on)

GPU Overclocked to 850/975/1700

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8
GPGPU: GeForce GTX 470, Compute Units: 14

Algorithm: Strassen's, Matrix Size: 1536x1536

Processing time for CPU 0 (sequential): 6649.5392 ms

Processing time for CPU 1 (sequential): 6602.9418 ms

Processing time for CPU 2 (sequential): 6584.4823 ms

Processing time for CPU 3 (sequential): 6614.7873 ms

Processing time for CPU * (parallel HT): 1881.7401 ms

Processing time for CPU * (parallel !HT): 1908.7813 ms

Processing time for GPU 0 (OpenCL): 40.8434 ms

Parallel speed-up: 3.499145, Efficiency: 87.478636% ( HT)

Parallel speed-up: 3.449574, Efficiency: 86.239349% (!HT)

GPGPU speed-up: 161.212884

Can conclude CPU speed makes only small difference in OpenCL

January 5, 2011

Changes

Vectorized the OpenCL kernel code which provides ~35% performance improvement in GPU performance
Added GPU capability checks to optimize the kernel build
Added GPU constraint checks to detect and prevent resource overrun issues

For those interested, the latest version of the program can be found here (link expires 1/12). Links to old versions have been removed.

January 5, 2011

C:\downloads\mm>matrixmultint
Unhandled Exception: System.IO.FileNotFoundException: Could not load file or arse

embly 'Cloo, Version=0.8.1.0, Culture=neutral, PublicKeyToken=null' or one of it

s dependencies. The system cannot find the file specified.

at matrixmultint.Program.Main(String[] args)

I get a crash when I run it

January 6, 2011

Sorry, my bad. It's been fixed. New version to be posted very shortly.

January 6, 2011

Changes

OpenCL functionality now enumerates and runs on all supported platforms and devices (including CPUs)¹
Since the OpenCL kernel code is vectorized, the corresponding CPU code is SSE optimized and will offer more than twice the performance on the CPU²

[1] To enable OpenCL for CPUs, ATI Stream SDK 2.3 needs to be installed. Install at your own risk since this will override the NVIDIA OpenCL Installable Client Driver (ICD). While I have not run into any issues, I strongly recommend backing up C:/Windows/System32/OpenCL.dll and C:/Windows/SysWOW64/OpenCL.dll prior to installing the Stream SDK.

[2] from the screenshot below, Platform 1 Device 0 is the CPU running vectorized (SSE) code on all eight logical processors. We're looking at a speed-up of 9.128770 over the serial and 2.355655 over the fully parallelized Stressen's implementation!

For those interested, the latest version of the program can be found here (link expires 1/13). Links to old versions have been removed.

January 6, 2011

Not using ATI Stream SDK 2.3

Algorithm: Strassen's, Matrix Size: 1536x1536
OpenCL Platforms: 1, OpenCL Devices: 1

Platform 0 Device 0: GeForce GTX 470, Compute Units: 14

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8

Processing time for CPU 0 (sequential): 6595.9145 ms

Processing time for CPU 1 (sequential): 6569.3299 ms

Processing time for CPU 2 (sequential): 6566.2564 ms

Processing time for CPU 3 (sequential): 6572.6406 ms

Processing time for CPU * (parallel HT): 1829.472 ms

Processing time for CPU * (parallel !HT): 1908.4017 ms

Processing time for Platform 0 Device 0 (OpenCL): 35.427 ms

Parallel speed-up: 3.589154, Efficiency: 89.728845% ( HT)

Parallel speed-up: 3.440710, Efficiency: 86.017745% (!HT)

OpenCL speed-up: 185.346103

Using ATI Stream SDK 2.3

Algorithm: Strassen's, Matrix Size: 1536x1536
OpenCL Platforms: 2, OpenCL Devices: 2

Platform 0 Device 0: GeForce GTX 470, Compute Units: 14

Platform 1 Device 0: Intel® Xeon® CPU X3470 @ 2.93GHz, Compute Units: 8

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8

Processing time for CPU 0 (sequential): 6778.4938 ms

Processing time for CPU 1 (sequential): 6586.0664 ms

Processing time for CPU 2 (sequential): 6570.9909 ms

Processing time for CPU 3 (sequential): 6575.0096 ms

Processing time for CPU * (parallel HT): 1857.3979 ms

Processing time for CPU * (parallel !HT): 1907.4667 ms

Processing time for Platform 0 Device 0 (OpenCL): 34.9568 ms

Processing time for Platform 1 Device 0 (OpenCL): 762.894 ms

Parallel speed-up: 3.537740, Efficiency: 88.443501% ( HT)

Parallel speed-up: 3.444878, Efficiency: 86.121961% (!HT)

OpenCL speed-up: 187.974612

One request to save coying/pasting, Command Prompting, any chance of a press any key to exit ?

Then we can run with a double click :)

Other than that, great work

January 6, 2011

Implemented as requested. :)

Same link for the download.

January 6, 2011

Implemented as requested. :)

Same link for the download.

Well done.

This is a nice little app to quickly bench CPU and GPU without grabbing huge files.

I like how it stresses all 8 cores.

January 7, 2011

Thought I'd shift this over here

January 8, 2011

Good call.

With Sandy Bridge around the corner, it'll be very interesting to see how well it performs on this test...

CUDA vs Quad Core Performance Test

Recommended Posts

Guest ptrein

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

mobilenvidia

Link to comment

Share on other sites

mobilenvidia

Link to comment

Share on other sites

Guest ptrein

Link to comment

Share on other sites

Guest ptrein

Link to comment

Share on other sites

mobilenvidia

Link to comment

Share on other sites

Guest ptrein

Link to comment

Share on other sites

mobilenvidia

Link to comment

Share on other sites

Guest ptrein

Link to comment

Share on other sites

mobilenvidia

Link to comment

Share on other sites

Bill

Link to comment

Share on other sites

Guest ptrein

Link to comment

Share on other sites

mobilenvidia

Link to comment

Share on other sites

Guest ptrein

Link to comment

Share on other sites

mobilenvidia

Link to comment

Share on other sites

mobilenvidia

Link to comment

Share on other sites

Guest ptrein

Link to comment

Share on other sites

mobilenvidia

Link to comment

Share on other sites

Guest ptrein

Link to comment

Share on other sites

Guest ptrein

Link to comment

Share on other sites

mobilenvidia

Link to comment

Share on other sites

Guest ptrein

Link to comment

Share on other sites

mobilenvidia

Link to comment

Share on other sites

mobilenvidia

Link to comment

Share on other sites

Guest ptrein

Link to comment