Jump to content
LaptopVideo2Go Forums

CUDA vs Quad Core Performance Test


Bill

Recommended Posts

Fixed. I had some spare cycles during lunch break.

This update supersedes all previous versions, which have been removed (link expires 12/8).

Changes

  1. Updated HT detection code to eliminate false positives on Vista and newer operating systems, and reduce false positives on pre-Vista operating systems
  2. Reduced settling period to eight seconds between runs and added a two second spool up period for the parallel runs

Link to comment
Share on other sites

  • Replies 81
  • Created
  • Last Reply

Top Posters In This Topic

  • mobilenvidia

    28

  • Bill

    13

  • ptrein

    2

  • Blacky

    1

The X3470 (ES) that I have is identical to the i7 875K but with ECC and VT-d plus some other stuff.

Had hoped that the unlocked version would be unlocked but it's not really, still only get 9-22x Multiplier.

But easily does 4Ghz, if I enable Turbo then 3.8Ghz (4.4Ghz)

Makes the desktop very snappy compared even to the i3 @ 4.4Ghz

Link to comment
Share on other sites

X3470 (ES) @ default 2.93Ghz

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8

Matrix Size: 1536x1536

Processing time for CPU (sequential): 33578 ms

Processing time for CPU (parallel HT on): 9511 ms

Processing time for CPU (parallel HT off): 14210 ms

Parallel speed-up: 3.530438, Efficiency: 88.260961% (HT on)

Parallel speed-up: 2.362984, Efficiency: 59.074595% (HT off)

Thats better :)

Hmmm, my HT off makes for quite a difference.

Looks like HT can do some actual work

Link to comment
Share on other sites

I did major refactoring of the code to see how far I could push the 'envelope' in a managed language like C#; the results are not disappointing. :)

Changes

  1. Now accepts command line options for multiplication algorithm and matrix size.
  2. Totally refactored matrix multiplication code. Program now implements two algorithms -- a naive version using a straightforward i,j,k loop and Strassen's algorithm.
  3. HT off simulation should work more consistently now.

matrixmultint_i7-950_a1s1536.png

matrixmultint_i7-950_a2s1536.png

For those interested, the latest version of the program can be found here (link expires 12/30). Links to old versions have been removed.

Link to comment
Share on other sites

All cores are not created equal. On my i7-950 the second core (CPU 1) is fastest, whereas on both my Core i7-620M and Core 2 Duo E8500 the first core (CPU 0) is fastest.

Changes

  1. Sequential computation is now performed on each available core. The best runtime is used for the speed-up and efficiency calculations.
  2. Added a dry run prior to the timed runs to negate any ramp-up associated with dynamic overclocking.

matrixmultint_1.1_i7-950_a1_s1536.png

matrixmultint_1.1_i7-950_a2_s1536.png

For those interested, the latest version of the program can be found here (link expires 12/31). Links to old versions have been removed.

Merry Christmas and enjoy! :)

Link to comment
Share on other sites

Core i5 460m

CPU ID: Intel64 Family 6 Model 37 Stepping 5, Cores: 2, Logical Processors: 4

Algorithm: Strassen's, Matrix Size: 1536x1536

Processing time for CPU 0 (sequential): 33141.6495 ms

Processing time for CPU 1 (sequential): 33882.3437 ms

Processing time for CPU * (parallel HT on ): 15413.6534 ms

Processing time for CPU * (parallel HT off): 18159.8282 ms

Parallel speed-up: 2.150149, Efficiency: 107.507444% (HT on)

Parallel speed-up: 1.824998, Efficiency: 91.249898% (HT off)

Link to comment
Share on other sites

Changes

  1. Added CUDA Toolkit 3.2 integration (finally)!
  2. CUDA matrix multiplications are performed on each available GPU using the CUBLAS optimized library.

matrixmultint_1.2_i7-950_a1_s1536.png

For those interested, the latest version of the program can be found here (link expires 1/4). Links to old versions have been removed.

Link to comment
Share on other sites

Very nice app now :)

Xeon x3470 @ 2.93Ghz

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8

CUDA GPU: GeForce GTX 470, Cores: 448

Algorithm: Strassen's, Matrix Size: 1536x1536

Processing time for CPU 0 (sequential): 8445.7032 ms

Processing time for CPU 1 (sequential): 8177.6437 ms

Processing time for CPU 2 (sequential): 8141.7464 ms

Processing time for CPU 3 (sequential): 8193.7191 ms

Processing time for CPU * (parallel HT): 2202.4177 ms

Processing time for CPU * (parallel !HT): 2304.8124 ms

Processing time for GPU 0 (CUBLAS): 25.3861 ms

Parallel speed-up: 3.696731, Efficiency: 92.418282% ( HT)

Parallel speed-up: 3.532499, Efficiency: 88.312463% (!HT)

CUDA GPU speed-up: 320.716707

Link to comment
Share on other sites

Thanks. All I did was take what Bill started, and pursued it down the path of whatever tickled my fancy. Hats off to Bill for starting this thread. :)

Time permitting, I'm looking at SSE SIMD instructions and OpenCL for upcoming versions. This should get us really close to the limits of CPU performance and cater to the AMD Radeon owners out there.

Link to comment
Share on other sites

Yes Bill the creator of all this does need kudos too :)

He's too busy playing with his Panda box, we are neglected children

Link to comment
Share on other sites

I haven't spent too much time with it, but I have built it into a case of sorts, so it actually resembles a tablet now. (Maybe I'll have pictures tomorrow)

However between driver issues (new so lack of drivers) and overheating it needs some work before I can really make use of it. (need a fan or heat sink in the enclosed area)

I also need to find a good touch screen to attach to it. Also will need to figure out how to set a good always-on on screen keyboard.

I did get OpenMP running on it though. I will post up some numbers later when I get a chance to run more tests.

Its one of the things that runs like it should. (wifi and bluetooth work as well)

I will need to compare it to a core 2 machine running at 1 GHz.

Between Christmas, birthdays, snow (The world is ending with the snow we are having here.), getting my car fixed multiple times, and driving/sliding all across town in snow/ice every day I've been busier over the break than before it started.

We brought this new sport to Alabama called Yukon Ice Skating.

Link to comment
Share on other sites

Guest ptrein

Changes

  1. Totally re-written GPGPU module that uses OpenCL* instead of CUDA
  2. Multi-GPU parallel compute on systems with more than one GPU

[*] I have very limited access to Radeon cards, so it might not work 100% on that architecture. For those brave souls willing to give it a shot, the latest AMD Catalyst Accelerated Parallel Processing (APP) Technology Edition driver is required.

matrixmultint_1.3_i7-950_a2_s1536.png

For those interested, the latest version of the program can be found here (link expires 1/10). Links to old versions have been removed.

Link to comment
Share on other sites

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8

GPGPU: GeForce GTX 470, Compute Units: 14

Algorithm: Strassen's, Matrix Size: 1536x1536

Processing time for CPU 0 (sequential): 7028.4161 ms

Processing time for CPU 1 (sequential): 6803.7261 ms

Processing time for CPU 2 (sequential): 6738.3781 ms

Processing time for CPU 3 (sequential): 6796.9911 ms

Processing time for CPU * (parallel HT): 1837.8873 ms

Processing time for CPU * (parallel !HT): 1917.5827 ms

Processing time for GPU 0 (OpenCL): 51.9128 ms

Parallel speed-up: 3.666372, Efficiency: 91.659294% ( HT)

Parallel speed-up: 3.513996, Efficiency: 87.849902% (!HT)

GPGPU speed-up: 129.801862

I didn't get any of the GPU OpenCL tests, GPUz says it's enabled.

But then again, it says PhysX isn't but it is.

I see the GTX460 has 7+7 Compute units and the GTX470 has 14

Link to comment
Share on other sites

Guest ptrein

The GTX460 has 7 compute units. Mine's an SLI system, thus 7 + 7.

The OpenCL results are there - fourth line from the bottom.

Wow, the 470 is 32% faster than the 460 in the OpenCL test. That's impressive. :)

Link to comment
Share on other sites

Ofcourse it is, I forgot your SLi setup :doh:

I noticed that too.

I'm running the CPU at 3.84Ghz (4.3Ghz Turbo) this might boost the performance.

The GTX470 is running stock clocks.

I may run some more scenarios, for something todo

Link to comment
Share on other sites

CPU default 2.93Ghz (133x22, Turbo on)

GPU default 608/837/1215Mhz

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8

GPGPU: GeForce GTX 470, Compute Units: 14

Algorithm: Strassen's, Matrix Size: 1536x1536

Processing time for CPU 0 (sequential): 8171.4389 ms

Processing time for CPU 1 (sequential): 8156.402 ms

Processing time for CPU 2 (sequential): 8294.0286 ms

Processing time for CPU 3 (sequential): 8581.3664 ms

Processing time for CPU * (parallel HT): 2269.4997 ms

Processing time for CPU * (parallel !HT): 2317.3234 ms

Processing time for GPU 0 (OpenCL): 55.0925 ms

Parallel speed-up: 3.593921, Efficiency: 89.848018% ( HT)

Parallel speed-up: 3.519751, Efficiency: 87.993782% (!HT)

GPGPU speed-up: 148.049226

CPU overclocked to 3520Mhz (160x22, Turbo on)

GPU default 608/837/1215Mhz

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8

GPGPU: GeForce GTX 470, Compute Units: 14

Algorithm: Strassen's, Matrix Size: 1536x1536

Processing time for CPU 0 (sequential): 7028.4161 ms

Processing time for CPU 1 (sequential): 6803.7261 ms

Processing time for CPU 2 (sequential): 6738.3781 ms

Processing time for CPU 3 (sequential): 6796.9911 ms

Processing time for CPU * (parallel HT): 1837.8873 ms

Processing time for CPU * (parallel !HT): 1917.5827 ms

Processing time for GPU 0 (OpenCL): 51.9128 ms

Parallel speed-up: 3.666372, Efficiency: 91.659294% ( HT)

Parallel speed-up: 3.513996, Efficiency: 87.849902% (!HT)

GPGPU speed-up: 129.801862

CPU default 2.93Ghz

GPU Overclocked to 850/975/1700

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8

GPGPU: GeForce GTX 470, Compute Units: 14

Algorithm: Strassen's, Matrix Size: 1536x1536

Processing time for CPU 0 (sequential): 8242.8043 ms

Processing time for CPU 1 (sequential): 8042.2601 ms

Processing time for CPU 2 (sequential): 8056.3077 ms

Processing time for CPU 3 (sequential): 8095.631 ms

Processing time for CPU * (parallel HT): 2187.2142 ms

Processing time for CPU * (parallel !HT): 3161.9652 ms

Processing time for GPU 0 (OpenCL): 42.6647 ms

Parallel speed-up: 3.676942, Efficiency: 91.923554% ( HT)

Parallel speed-up: 2.543437, Efficiency: 63.585931% (!HT)

GPGPU speed-up: 188.499168

CPU Overclocked to 3520Mhz (160x22, Turbo on)

GPU Overclocked to 850/975/1700

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8

GPGPU: GeForce GTX 470, Compute Units: 14

Algorithm: Strassen's, Matrix Size: 1536x1536

Processing time for CPU 0 (sequential): 6649.5392 ms

Processing time for CPU 1 (sequential): 6602.9418 ms

Processing time for CPU 2 (sequential): 6584.4823 ms

Processing time for CPU 3 (sequential): 6614.7873 ms

Processing time for CPU * (parallel HT): 1881.7401 ms

Processing time for CPU * (parallel !HT): 1908.7813 ms

Processing time for GPU 0 (OpenCL): 40.8434 ms

Parallel speed-up: 3.499145, Efficiency: 87.478636% ( HT)

Parallel speed-up: 3.449574, Efficiency: 86.239349% (!HT)

GPGPU speed-up: 161.212884

Can conclude CPU speed makes only small difference in OpenCL

Link to comment
Share on other sites

Guest ptrein

Changes

  1. Vectorized the OpenCL kernel code which provides ~35% performance improvement in GPU performance
  2. Added GPU capability checks to optimize the kernel build
  3. Added GPU constraint checks to detect and prevent resource overrun issues

matrixmultint_1.3.5_i7-950_a2_s1536.png

For those interested, the latest version of the program can be found here (link expires 1/12). Links to old versions have been removed.

Link to comment
Share on other sites

C:\downloads\mm>matrixmultint

Unhandled Exception: System.IO.FileNotFoundException: Could not load file or arse

embly 'Cloo, Version=0.8.1.0, Culture=neutral, PublicKeyToken=null' or one of it

s dependencies. The system cannot find the file specified.

at matrixmultint.Program.Main(String[] args)

I get a crash when I run it

Link to comment
Share on other sites

Guest ptrein

Changes

  1. OpenCL functionality now enumerates and runs on all supported platforms and devices (including CPUs)1
  2. Since the OpenCL kernel code is vectorized, the corresponding CPU code is SSE optimized and will offer more than twice the performance on the CPU2

[1] To enable OpenCL for CPUs, ATI Stream SDK 2.3 needs to be installed. Install at your own risk since this will override the NVIDIA OpenCL Installable Client Driver (ICD). While I have not run into any issues, I strongly recommend backing up C:/Windows/System32/OpenCL.dll and C:/Windows/SysWOW64/OpenCL.dll prior to installing the Stream SDK.

[2] from the screenshot below, Platform 1 Device 0 is the CPU running vectorized (SSE) code on all eight logical processors. We're looking at a speed-up of 9.128770 over the serial and 2.355655 over the fully parallelized Stressen's implementation!

matrixmultint_1.3.6_i7-950_a2_s1536.png

For those interested, the latest version of the program can be found here (link expires 1/13). Links to old versions have been removed.

Link to comment
Share on other sites

Not using ATI Stream SDK 2.3

Algorithm: Strassen's, Matrix Size: 1536x1536

OpenCL Platforms: 1, OpenCL Devices: 1

Platform 0 Device 0: GeForce GTX 470, Compute Units: 14

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8

Processing time for CPU 0 (sequential): 6595.9145 ms

Processing time for CPU 1 (sequential): 6569.3299 ms

Processing time for CPU 2 (sequential): 6566.2564 ms

Processing time for CPU 3 (sequential): 6572.6406 ms

Processing time for CPU * (parallel HT): 1829.472 ms

Processing time for CPU * (parallel !HT): 1908.4017 ms

Processing time for Platform 0 Device 0 (OpenCL): 35.427 ms

Parallel speed-up: 3.589154, Efficiency: 89.728845% ( HT)

Parallel speed-up: 3.440710, Efficiency: 86.017745% (!HT)

OpenCL speed-up: 185.346103

Using ATI Stream SDK 2.3

Algorithm: Strassen's, Matrix Size: 1536x1536

OpenCL Platforms: 2, OpenCL Devices: 2

Platform 0 Device 0: GeForce GTX 470, Compute Units: 14

Platform 1 Device 0: Intel® Xeon® CPU X3470 @ 2.93GHz, Compute Units: 8

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8

Processing time for CPU 0 (sequential): 6778.4938 ms

Processing time for CPU 1 (sequential): 6586.0664 ms

Processing time for CPU 2 (sequential): 6570.9909 ms

Processing time for CPU 3 (sequential): 6575.0096 ms

Processing time for CPU * (parallel HT): 1857.3979 ms

Processing time for CPU * (parallel !HT): 1907.4667 ms

Processing time for Platform 0 Device 0 (OpenCL): 34.9568 ms

Processing time for Platform 1 Device 0 (OpenCL): 762.894 ms

Parallel speed-up: 3.537740, Efficiency: 88.443501% ( HT)

Parallel speed-up: 3.444878, Efficiency: 86.121961% (!HT)

OpenCL speed-up: 187.974612

One request to save coying/pasting, Command Prompting, any chance of a press any key to exit ?

Then we can run with a double click :)

Other than that, great work

Link to comment
Share on other sites

Implemented as requested. :)

Same link for the download.

Well done.

This is a nice little app to quickly bench CPU and GPU without grabbing huge files.

I like how it stresses all 8 cores.

Link to comment
Share on other sites

Thought I'd shift this over here

Link to comment
Share on other sites


×
×
  • Create New...