Jump to content
LaptopVideo2Go Forums

CUDA vs Quad Core Performance Test


Bill

Recommended Posts

I just tried changing the clock rate and the program reports the same number as before.

It reads the capabilities from the GPU firmware, and there is a different function to determine the actual clock rate.

(cudaGetDeviceProperties() vs cuDeviceGetAttribute()) The second should return the actual clock rate.

I am going to try the new function now.

Cool, thanks.

I'm still telling you guys that I'm not trying to optimize this for Core 2's.

No harm, no foul on your part; I was just providing feedback on what I noticed with the CPU results.

I would have to go change something (complicated) in visual studio and use the intel compiler, which I don't have on my Clevo laptop.

While the laptop has a Q6600, this is standard visual studio compiling, probably with older SSE sets.

I will try adding more SSE to this though. Maybe I can put multiple code paths in so if you have SSE4 you can use it, that might speed it up on newer CPUs.

What version of VS do you have on the laptop? VS 2008+ support SSE4 intrinsic functions (which I believe OpenMP uses) via the /Oi compiler flag.

If I was putting loads of optimizations in this thing than SSE4 CPUs would be the fastest as they have floating point multiply and accumulate type instructions. (maybe core i7 only)

These would do ops in one instruction instead of multiple. I don't know how many cycles each type of CPU would need for these though.

I would hope so. We would be taking a step backwards in micro-architecture otherwise. :)

Now my Q9650 might be real fast cause the 12MB cache might be the best for this type of application, or at least the way the code runs on it.

All the AMD CPUs I have seen benched in this thread only have 2 MB cache per 2 cores I think, where as my Core 2 has 6MB.

That premise would be true if the program is memory bound...

Also M$ was lazy and didn't put OpenMP 3.0 support in visual studio yet. (2.0 is OLD)

I can probably try compiling the CPU part of the program with a GCC or intel version with OpenMP 3.

That would be nice. Version 2.0 was released in 2002. OpenMP 3.0 has tasks and better supports loop and nested parallelism. It would be interesting to see if there are any performance differences between the versions for this particular application.

Maybe the old version of OpenMP has some performance issues. The Itanium super computer that is practically down the street from my apartment even has newer OpenMP.

http://www.asc.edu/html/altix.shtml (we get to use this for class, until Friday anyway)

Do you know what they'll be replacing it with? Hopefully something similar to this! :)

Link to comment
Share on other sites

  • Replies 81
  • Created
  • Last Reply

Top Posters In This Topic

  • mobilenvidia

    28

  • Bill

    13

  • ptrein

    2

  • Blacky

    1

I'm using VS2010 + VS2008 SP1's compilers because its needed for parallel Nsight.

Considering these matrices can take up more memory than you have cache I think that could definitely play a part in the performance. But I don't know how cache on Intel CPUs works.

We don't cover stuff like that in class.

I think MS will add OpenMP 3.0 in a service pack maybe.

I can probably then compile the c++ part of the code using it. (or intel's compiler)

Link to comment
Share on other sites

True, in instances where the matrix elements cannot fit the cache it would boil down to locality of reference (where cache design plays a key role); though algorithmic tweaks to how the matrices are multiplied might help improve temporal locality by allowing the program to re-use a block a data more times before it moves out of the cache.

Based on this response on MS Connect, I don't believe OpenMP 3.0 will be coming to VS anytime soon. :)

On a related note, Microsoft has been focusing their efforts on their own multi-processing API - Parallel Patterns Library (for C++) and .NET Parallel Extensions (for their CLR languages). There are similarities (e.g., tasks) to what's offered in OpenMP 3.0 on inspection, but as always, it's all in the details.

BTW, thank you for sharing your program with us. :)

Link to comment
Share on other sites

device name: GeForce 8600M GT <----- creating CUDA context on this device

device sharedMemPerBlock: 16384

device totalGlobalMem: 251199488

device regsPerBlock: 8192

device warpSize: 32

device memPitch: 2147483647

device maxThreadsPerBlock: 512

device maxThreadsDim[0]: 512

device maxThreadsDim[1]: 512

device maxThreadsDim[2]: 64

device maxGridSize[0]: 65535

device maxGridSize[1]: 65535

device maxGridSize[2]: 1

device totalConstMem: 65536

device major: 1

device minor: 1

device clockRate: 1050000

device textureAlignment: 256

device deviceOverlap: 1

device multiProcessorCount: 4

Total CUDA cores: 32

Processing time for GPU: 140 (ms)

Processing time for CPU 1 thread: 23119 (ms)

Processing time for CPU 2 threads: 17160 (ms)

CPU multithread speedup: 1.347261, efficiency: 67.363052

CPU to GPU time ratio (CUDA Speedup): 122.571426

Note: OC-525/450/1050

Edited by Night Wing
Link to comment
Share on other sites

Hmmmm, I just installed Win7 x64 on the computer.

Now the max matrix setting = 375, 380+ causes the driver service to stop/restart.

400+ causes the application to crash.

With XP x64 it was 512 (didn't find exact number, 550 didn't work)

Does Aero eat all that memory ?

Win7 also gives a worse score.

A steady 63ms for the matrix setting at 96, close to twice the XP x64 score or 4x best score (not overclocked)

Used the same version driver 263.09 for both WinXP and Win 7 x64 tests

Link to comment
Share on other sites

Do you only have like 2 GB system RAM on that machine?

On my machine with 4 GB RAM the 512 multiple setting only used about 850-900MB total system RAM. (it varied)

There is also Intel's new Threaded Building Blocks (TBB) that they are working on.

There is also PThreads, MPI, etc.

It will be interesting to see some competition in this area.

Maybe in a few years I won't use OpenMP anymore, if something else runs better.

Link to comment
Share on other sites

1GB on the P4 beast.

Win7 must use RAM more liberally than WinXP x64

I could later on not even get to 375 without out of mem crashes

Kept an eye on the memory usage, it goes up in steady jumps as the app runs.

If the matrix is kept in system RAM than that will play a large part in the overall score.

I'll do more testing once the new rig is running

If this darned CPU ever turns up I'll be able to test with 8GB

I'm getting annoyed now, all taking very long, first thing I ordered by far the last to arrive.

I've pulled the GTX470 out of the P4.

Been tweaking the case fans on the new Case also changed fans on the Noctua CPU cooler.

They get the 1900RPM wind turbines, now a whole heap of air is going to get shifted

Link to comment
Share on other sites

From my experience on the same machine Win7 64 used about 800 MB of RAM on a fresh install when XP 64 used under 300 MB.

With all the drivers and programs it changes a bit but still Win 7 uses a lot more RAM.

Like I said above if you select the 512 multiple it will try to allocated over 768MB of system RAM.

There are (512*16)*(512*16) number of items in each matrix.

Each item is 4 bytes (32bit IEEE floating point) and then there are 3 matrices total. The first, the second, and the result from multiplying.

Link to comment
Share on other sites

  • 2 weeks later...

Do you only have like 2 GB system RAM on that machine?

On my machine with 4 GB RAM the 512 multiple setting only used about 850-900MB total system RAM. (it varied)

There is also Intel's new Threaded Building Blocks (TBB) that they are working on.

There is also PThreads, MPI, etc.

It will be interesting to see some competition in this area.

Maybe in a few years I won't use OpenMP anymore, if something else runs better.

Just stumbled upon Thrust today. It's basically a C++ template library for CUDA. One really neat feature you might be interested in is that you can switch between CUDA and OpenMP device back-ends with no changes to the source code. Pretty neat stuff indeed...

Link to comment
Share on other sites

GTX470

Core i3 3.06Ghz

8GB RAM

Processing time for GPU: 31 (ms)

Processing time for CPU 1 thread: 29391 (ms)

Processing time for CPU 2 threads: 14820 (ms)

CPU multithread speedup: 1.983198, efficiency: 99.159920

CPU to GPU time ratio (CUDA Speedup): 478.064514

Almost perfect scaling for threads, the Hyper Threading helping here.

Processing time for GPU: 47 (ms)

Processing time for CPU 1 thread: 32932 (ms)

Processing time for CPU 4 threads: 15740 (ms)

CPU multithread speedup: 2.092249, efficiency: 52.306229

CPU to GPU time ratio (CUDA Speedup): 334.893616

Running all 4 threads reward us with very little performance gain, HT is good for making the most of CPU cores but do very little them selves.

Link to comment
Share on other sites

I too noticed that CPU performance tends to favor the Intel Core micro-architecture (which might indicate that the code/library was optimized for it). For example, here are the single-thread CPU results from three of my systems:

[Core i7 950 / Nehalem / Bloomfield]  Processing time for CPU 1 thread: 24383 (ms)
[Core i7 620M / Nehalem / Arrandale]  Processing time for CPU 1 thread: 41917 (ms)
[Core 2 Duo E8500 / Core / Wolfdale]  Processing time for CPU 1 thread: 14804 (ms)

I'm still perplexed by the discrepancy in CPU run-times (especially the -40% difference from Core to Nehalem; ~ +10% was what I was expecting). To prevent myself from going insane, I decided to craft a straightforward (mostly unoptimized) test program in C# that multiplies two 1536x1536 matrices. The only tweak I performed to the algorithm was to change the order of the loops from i,j,k to i,k,j to improve temporal locality.

Here's what I got. The numbers are closer to what I was expecting (hopefully for AMD owners also).

[intel Core i7-950 w/ Hyper-Threading disabled]
Matrix Size: 1536x1536, Cores: 4, Logical Processors: 4
Processing time for CPU (sequential): 23748 ms
Processing time for CPU (parallel): 7244 ms
Parallel speed-up: 3.278299, Efficiency: 92.661838%

[intel Core i7-950]
Matrix Size: 1536x1536, Cores: 4, Logical Processors: 8
Processing time for CPU (sequential): 23952 ms
Processing time for CPU (parallel): 8321 ms
Parallel speed-up: 2.878500, Efficiency: 87.012915%

[intel Core 2 Duo E8500]
Matrix Size: 1536x1536, Cores: 2, Logical Processors: 2
Processing time for CPU (sequential): 26470 ms
Processing time for CPU (parallel): 14460 ms
Parallel speed-up: 1.830567, Efficiency: 90.744239%

[intel Core i7-620M]
Matrix Size: 1536x1536, Cores: 2, Logical Processors: 4
Processing time for CPU (sequential): 35717 ms
Processing time for CPU (parallel): 21060 ms
Parallel speed-up: 1.695964, Efficiency: 82.072962%

The test program can be found here (expires 12/23) for those who are interested. Since it uses parallel extensions, Microsoft .NET Framework 4.0 is required (Client Profile or Full doesn't matter). It is interesting to note that enabling Hyper-Threading actually hurt performance in this particular application!

Again, I would like to re-iterate that I did this to satisfy my own curiosity. At the end of the day, CUDA performance is heads and shoulders above what the best multi-core CPUs can muster (as Bill's program demonstrates), and THAT is a good thing. :)

[MV EDIT] I've attached the file for those that would like to try it after the 23rd :)

matrixmult.zip

Link to comment
Share on other sites

Core2Duo 2.16Ghz

Matrix Size: 1536x1536, Cores: 2, Logical Processors: 2

Processing time for CPU (sequential): 57003 ms

Processing time for CPU (parallel): 22761 ms

Parallel speed-up: 2.504415, Efficiency: 120.141045%

Core i3 4.0Ghz

Matrix Size: 1536x1536, Cores: 2, Logical Processors: 4

Processing time for CPU (sequential): 21662 ms

Processing time for CPU (parallel): 14703 ms

Parallel speed-up: 1.473305, Efficiency: 64.250762%

C2D is very proficient at parallelism, a 4Ghz HT i3 is really not all that much faster than a 2.16Ghz C2D

Link to comment
Share on other sites

It does look like the Core processors are more efficient than the Nehalem processors for this embarrassingly parallel workload. Here are the results for a dual Xeon L5335 (Core-based) system:

Matrix Size: 1536x1536, Cores: 8, Logical Processors: 8
Processing time for CPU (sequential): 42107 ms
Processing time for CPU (parallel): 5878 ms
Parallel speed-up: 7.163491, Efficiency: 114.720434%

Link to comment
Share on other sites

Correction to the efficiency. I'm running a development version of the code and used the wrong variable in the efficiency calculation; I've been mucking with the core and logical processor detection logic (for non Vista-based systems). Observation for Core versus Nehalem efficiency still stands.

Matrix Size: 1536x1536, Cores: 8, Logical Processors: 8

Processing time for CPU (sequential): 42107 ms

Processing time for CPU (parallel): 5878 ms

Parallel speed-up: 7.163491, Efficiency: 98.331801%

Link to comment
Share on other sites

How are you calculating your efficiency?

http://en.wikipedia.org/wiki/Speedup

I am using that equation to calculate it in my programs (just like we do in class)

If you have 8 processors and it is only running 7.16 times as fast on all 8 as it is on one CPU then your parallel implementation is definitely not 98% efficient.

Today is exciting, I just got my first Cortex A9 computer in, the pandaboard.

http://pandaboard.org/

I will run some OpenMP on it later. (on ubuntu 10.10)

I have taken some pictures and will post them up soon.

Link to comment
Share on other sites

How are you calculating your efficiency?

http://en.wikipedia.org/wiki/Speedup

I am using that equation to calculate it in my programs (just like we do in class)

If you have 8 processors and it is only running 7.16 times as fast on all 8 as it is on one CPU then your parallel implementation is definitely not 98% efficient.

Today is exciting, I just got my first Cortex A9 computer in, the pandaboard.

http://pandaboard.org/

I will run some OpenMP on it later. (on ubuntu 10.10)

I have taken some pictures and will post them up soon.

I did not use the standard equation; instead of Ep = (T1 / Tp) / p, I used Ep = (T1 - Tp) / (T1 - (T1 / p)) to calculate efficiency. I should have looked it up, my bad. Thanks for pointing it out. I've updated the efficiency equation in the program and have posted the updated version here (the link expires 12/25 due to provider policy); the previous version has been removed.

Updated results based on formula change:

[intel Core i7-950 w/ Hyper-Threading disabled]

Matrix Size: 1536x1536, Cores: 4, Logical Processors: 4

Processing time for CPU (sequential): 23748 ms

Processing time for CPU (parallel): 7244 ms

Parallel speed-up: 3.278299, Efficiency: 92.661838% 81.957482%

[intel Core i7-950]

Matrix Size: 1536x1536, Cores: 4, Logical Processors: 8

Processing time for CPU (sequential): 23952 ms

Processing time for CPU (parallel): 8321 ms

Parallel speed-up: 2.878500, Efficiency: 87.012915% 71.962505%

[intel Core 2 Duo E8500]

Matrix Size: 1536x1536, Cores: 2, Logical Processors: 2

Processing time for CPU (sequential): 26470 ms

Processing time for CPU (parallel): 14460 ms

Parallel speed-up: 1.830567, Efficiency: 90.744239% 91.528354%

[intel Core i7-620M]

Matrix Size: 1536x1536, Cores: 2, Logical Processors: 4

Processing time for CPU (sequential): 35717 ms

Processing time for CPU (parallel): 21060 ms

Parallel speed-up: 1.695964, Efficiency: 82.072962% 84.798196%

[intel Xeon L5335]

Matrix Size: 1536x1536, Cores: 8, Logical Processors: 8

Processing time for CPU (sequential): 42107 ms

Processing time for CPU (parallel): 5878 ms

Parallel speed-up: 7.163491, Efficiency: 98.331801% 89.543637%

[MV's Core2Duo 2.16Ghz]

Matrix Size: 1536x1536, Cores: 2, Logical Processors: 2

Processing time for CPU (sequential): 57003 ms

Processing time for CPU (parallel): 22761 ms

Parallel speed-up: 2.504415, Efficiency: 120.141045% 125.220772%

[MV's Core i3 4.0Ghz]

Matrix Size: 1536x1536, Cores: 2, Logical Processors: 4

Processing time for CPU (sequential): 21662 ms

Processing time for CPU (parallel): 14703 ms

Parallel speed-up: 1.473305, Efficiency: 64.250762% 73.665238%

Sweet! It will be interesting to see how well parallel programs run on the Cortex A9...

Link to comment
Share on other sites

Had a Eureka moment. I believe I found one contributing factor to the lower efficiency ratings for Nehalem -- Turbo Boost!

The base clock for the i7-950 processor is 133 x 23 = 3059 MHz. With all four cores in use, the clock is boosted to 3059 + 133 x 1 = 3192 MHz. With only one core in use, the clock is boosted to 3059 + 133 x 2 = 3325 MHz.

The base clock for the i7-620M processor is 133 x 20 = 2660 MHz. With both cores in use, the clock is boosted to 2660 + 133 x 3 = 2926 MHz. With only one core in use, the clock is boosted to 2660 + 133 x 5 = 3325 MHz.

Sorry, there's no Turbo Boost for Core i3.

By normalizing the results, the Nehalem numbers look more encouraging.

[intel Core i7-950 w/ Hyper-Threading disabled]

Matrix Size: 1536x1536, Cores: 4, Logical Processors: 4

Processing time for CPU (sequential): 23748 ms (@ 3325 MHz), 24738 ms (@ 3192 MHz)

Processing time for CPU (parallel): 7244 ms (@ 3192 MHz)

Parallel speed-up: 3.278299, Efficiency: 81.957482%

Parallel speed-up: 3.414964, Efficiency: 85.374103% (normalized)

[intel Core i7-950]

Matrix Size: 1536x1536, Cores: 4, Logical Processors: 8

Processing time for CPU (sequential): 23952 ms (@ 3325 MHz), 24973 ms (@ 3192 MHz)

Processing time for CPU (parallel): 8321 ms (@ 3192 MHz)

Parallel speed-up: 2.878500, Efficiency: 71.962505%

Parallel speed-up: 3.001201, Efficiency: 75.030044% (normalized)

[intel Core i7-620M]

Matrix Size: 1536x1536, Cores: 2, Logical Processors: 4

Processing time for CPU (sequential): 35717 ms (@ 3325 MHz), 40588 ms (@ 2926 MHz)

Processing time for CPU (parallel): 21060 ms (@ 2926 MHz)

Parallel speed-up: 1.695964, Efficiency: 84.798196%

Parallel speed-up: 1.927255, Efficiency: 96.362773% (normalized)

Ah, the joys of dynamic overclocking...

Link to comment
Share on other sites

I did more research on Hyper-Threading (here's a dated but still really good article on the subject), and found that resource competition is its proverbial Kryptonite. And, guess how many floating-point execution units there are in the Nehalem micro-architecture? One (same as Core).

I changed the matrix type in the test program from double to int, and re-ran the program on my systems. The results speak for themselves.

[intel Core i7-950 w/ Hyper-Threading disabled]

Matrix Size: 1536x1536, Cores: 4, Logical Processors: 8

Processing time for CPU (sequential): 29638 ms (@ 3325 MHz), 30873 ms (@ 3192 MHz)

Processing time for CPU (parallel): 7912 ms (@ 3192 MHz)

Parallel speed-up: 3.745956, Efficiency: 93.648888%

Parallel speed-up: 3.902047, Efficiency: 97.551188% (normalized)

[intel Core i7-950]

Matrix Size: 1536x1536, Cores: 4, Logical Processors: 8

Processing time for CPU (sequential): 29577 ms (@ 3325 MHz), 30809 ms (@ 3192 MHz)

Processing time for CPU (parallel): 7748 ms (@ 3192 MHz)

Parallel speed-up: 3.817372, Efficiency: 95.434306%

Parallel speed-up: 3.976381, Efficiency: 99.409525% (normalized)

[intel Core i7-620M]

Matrix Size: 1536x1536, Cores: 2, Logical Processors: 4

Processing time for CPU (sequential): 36763 ms (@ 3325 MHz), 41776 ms (@ 2926 MHz)

Processing time for CPU (parallel): 19532 ms (@ 2926 MHz)

Parallel speed-up: 1.882193, Efficiency: 94.109666%

Parallel speed-up: 2.138849, Efficiency: 106.942453% (normalized)

[intel Core 2 Duo E8500]

Matrix Size: 1536x1536, Cores: 2, Logical Processors: 2

Processing time for CPU (sequential): 32556 ms

Processing time for CPU (parallel): 16949 ms

Parallel speed-up: 1.920821, Efficiency: 96.041064%

Hyper-Threading is actually effective for non floating-point intensive applications!

For those interested, here's a link to the integer version of the test program (link expires 12/26).

Link to comment
Share on other sites

I made a few tweaks to the matrixmultint test program.

Changes

  1. Added CPU Hyper-Threading auto-detection (only for Nehalem processors for now)
  2. If Hyper-Threading support is detected, it will perform an additional round of parallel computation with the threads bound to the physical processors to simulate non-HT operation
  3. Sequential computations are now bound to the first physical processor to prevent the OS scheduler from skewing the results
  4. Added a 10 second settling period between runs

Known Issues

  1. The OS scheduler does not always honor the processor affinity request and might not bind the process to all physical processors during the non-HT computation. Re-running the program (sometimes more than one re-run is necessary) fixes this issue.

Sample results below.

[intel Core i7-950]
CPU ID: Intel64 Family 6 Model 26 Stepping 5, Cores: 4, Logical Processors: 8
Matrix Size: 1536x1536
Processing time for CPU (sequential): 29817 ms
Processing time for CPU (parallel HT on): 7759 ms
Processing time for CPU (parallel HT off): 7877 ms
Parallel speed-up: 3.842892, Efficiency: 96.072303% (HT on)
Parallel speed-up: 3.785324, Efficiency: 94.633109% (HT off)

[intel Core i7-620M]
CPU ID: Intel64 Family 6 Model 37 Stepping 2, Cores: 2, Logical Processors: 4
Matrix Size: 1536x1536
Processing time for CPU (sequential): 37945 ms
Processing time for CPU (parallel HT on): 19852 ms
Processing time for CPU (parallel HT off): 22506 ms
Parallel speed-up: 1.911394, Efficiency: 95.569716% (HT on)
Parallel speed-up: 1.685995, Efficiency: 84.299742% (HT off)

For those interested, the updated test program can be found here (link expires 12/25).

Link to comment
Share on other sites

Xeon X3470 (ES) @ 4Ghz, 4 cores - 8 threads

Processing time for GPU: 32 (ms)

Processing time for CPU 1 thread: 21906 (ms)

Processing time for CPU 8 threads: 5453 (ms)

CPU multithread speedup: 4.017238, efficiency: 50.215477

CPU to GPU time ratio (CUDA Speedup): 170.406250

X3470 @ 4Ghz 4 cores - 4 threads

Processing time for GPU: 47 (ms)

Processing time for CPU 1 thread: 22109 (ms)

Processing time for CPU 4 threads: 5625 (ms)

CPU multithread speedup: 3.930489, efficiency: 98.262222

CPU to GPU time ratio (CUDA Speedup): 119.680855

Link to comment
Share on other sites

Xeon X3470 (ES) @ 4Ghz

Matrix Size: 1536x1536, Cores: 4, Logical Processors: 8

Processing time for CPU (sequential): 29556 ms

Processing time for CPU (parallel): 8236 ms

Parallel speed-up: 3.588635, Efficiency: 96.179005%

My X3470 is a Lynnfield CPU, but does not get detected as such with the HT feature.

Link to comment
Share on other sites

BTW, I did some performance monitoring when it ran on my Core i7-950... IXTU couldn't keep up and showed the wrong clock frequency when all logical cores were utilized LOL.

matrixmultint_i7-950_1.png

matrixmultint_i7-950_2.png

matrixmultint_i7-950_3.png

Link to comment
Share on other sites

The output looks like it came from an early version of matrixmult... try http://fileshar.es/Ed6q4YP

Still not seeing it as a Nahelem.

X3470 @ 3650Mhz with Turbo enabled (4.3Ghz)

CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8

Matrix Size: 1536x1536

Processing time for CPU (sequential): 27330 ms

Processing time for CPU (parallel): 7811 ms

Parallel speed-up: 3.498912, Efficiency: 87.472795%

Link to comment
Share on other sites

The X3470 uses the same CPU ID string as the Lynnfield Core i5, which does not support HT. I'll fix the HT detection logic when I get back from work. Look for an update this evening.

Link to comment
Share on other sites


×
×
  • Create New...