Jump to content
LaptopVideo2Go Forums
Bill

CUDA vs Quad Core Performance Test

Recommended Posts

Bill

So I've been working on my CUDA project for class.

I have this matrix multiplication program that can multiply two 1,536x1,536 dimension matrices using a single Nvidia GPU, a single thread on your primary CPU, and 4 threads using OpenMP.

This was compiled with visual studio 2010 professional (using V90 compilers with Nvidia parallel Nsight 1.5)

Now it is probably possible to use the Intel Compiler with SSE4 to speed up the CPU implementation somewhat, but I'll mess with that later when I have some more time.

I have attached 32 and 64 bit windows builds of the program so you guys can test it out. Right now the interface is very basic, it is hard coded to do 4 threads for the CPU, so if you have a dual core the efficiency won't be what it should, but you will still see what the CUDA speedup is vs your CPU.

The program was targeted at my 2 development machines that both have Intel quad cores and 2 nvidia GPUs. You have to select which GPU to use. (just select one if you only have one)

First up we have an integrated GPU, the 9300. Even with only 16 CUDA cores it manages to beat out an over clocked Q9650 running 4 threads by a factor of more than 4. (the final time ratio at the bottem shows the 4 thread CPU vs GPU time)

CUDA9300.png

Next we have my Clevo D901C SLI laptop running a stock speed Q6600. It is only running 1 GPU here so if I changed the code to run on both GPUs, the GPU time would be about cut in half.

With both GPUs the laptop has more than 30 times the floating point power on its SLI system than the Q6600 has.

CUDAtimeD901C1.png

Next we have my EVGA FTW edition GTX 275 with a speedup of over 90 vs the Q9650 running 4 threads.

CUDAGTX275.png

CUDAGTX275_2.png

I don't know why it was sometimes a lot faster, probably something with the cache.

There are probably some communication times and cache effects affecting these numbers, and the hard coded matrix size might be a little low to see full GPU stress, although it is high enough to see good efficiency on an intel qaud.

Anyway this is just something neat to experiment with. You guys should be able to run the exes no problem hopefully. I will test this on my XP64 machine that doesn't have any dev tools installed.

I might upgrade the UI a bit and upload a new version later.

Edit: As Pieter found out the hard way you need a GF 8 series or higher or you will get a funny error. (they don't support proper CUDA)

Also I just tested this on a dual core machine and the performance was much worse than it should be.

I'll need to add that thread option to the command argument.

Edit2: I just added a new version of the program, you can change the number of threads and block size. If you choose a large block size it might take a while to run, especially if you hit a memory bottleneck somewhere. I think my Q9650 runs into a bottleneck right around 2048x2048 and the performance and multi-threading efficiency goes down dramatically. I have other programs running idle in the background though.

It would be interesting to see where the memory bottleneck is on Core 2 vs Core i7 Machines.

CudaPerformance3.7z

Share this post


Link to post
Share on other sites
Blacky

Nvidia 280M vs. QX9300 @ 2.93 GHz.

niceop.jpg

Share this post


Link to post
Share on other sites
mobilenvidia

Now I'm looking forward to CPU arriving even more.

The X3470 with a GTX470 should do rather well with this.

You possibly also may need to adjust the CPU treads up or down.

Nahelem CPU's with HT can go as high as 16 threads.

Share this post


Link to post
Share on other sites
Bill

I just updated the program.

Keep in mind this is not the most efficient way to run Matrix Multiplication on an Intel quad core, SSE4.x instructions might improve this somewhat. If I ever get ICC to run correctly it might increase the CPU speed by up to 50%.

The CUDA part was written in a very optimized manner by Nvidia.

Even if the CPU program was optimized, my Q9650 would still get its butt kicked by a 9300 though.

Share this post


Link to post
Share on other sites
Guest Sensei

This software looks interesting :)

x64 version (x32 seems to crash on my W7 btw)

Select which GPU to run the test on. Enter 1 for the first GPU, etc.

1

Select the number of threads for the CPU test.

2

Select the block multiple for the matrix size. (version one is 96)

(64 for 1024x1024, 96 for 1536x1536, 128 for 2048x2048, etc.)

96

device name: GeForce GTS 250 <----- creating CUDA context on this device

device sharedMemPerBlock: 16384

device totalGlobalMem: 511246336

device regsPerBlock: 8192

device warpSize: 32

device memPitch: 2147483647

device maxThreadsPerBlock: 512

device maxThreadsDim[0]: 512

device maxThreadsDim[1]: 512

device maxThreadsDim[2]: 64

device maxGridSize[0]: 65535

device maxGridSize[1]: 65535

device maxGridSize[2]: 1

device totalConstMem: 65536

device major: 1

device minor: 1

device clockRate: 1836000

device textureAlignment: 256

device deviceOverlap: 1

device multiProcessorCount: 16

Total CUDA cores: 128

Processing time for GPU: 78 (ms)

Processing time for CPU 1 thread: 20748 (ms)

Processing time for CPU 2 threads: 11793 (ms)

CPU multithread speedup: 1.759349, efficiency: 87.967438

CPU to GPU time ratio (CUDA Speedup): 151.192307

Share this post


Link to post
Share on other sites
karl_w_w

9400 GT vs Athlon II X2 240 @ 3.45 ghz

Processing time for GPU: 1029 (ms)

Processing time for CPU 1 thread: 64678 (ms)

Processing time for CPU 2 threads: 45802 (ms)

CPU multithread speedup: 1.412122, efficiency: 70.606087

CPU to GPU time ratio (CUDA Speedup): 44.511177

What's with the horrible multithread speedup? Either cos it's an AMD chip, or cos it's 2 cores not 4, or cos of my big OC (0.65 ghz)?

Share this post


Link to post
Share on other sites
Bill

It could be the combination of the inferior architecture on your chip (compared to a good core 2 with more cache) and maybe its throttling with the OC if its too hot.

You might have a memory bottleneck somewhere (cache or RAM or something) or maybe other things running on your system.

You might be running really large matrix dim to get times that long. (You didn't say which you tested)

Try with 96 block multiple.

Also I think VS has an older version of OpenMP built in (maybe) that could cause a decrease.

I am not putting intel specific code in this program though, so that is not the problem.

Share this post


Link to post
Share on other sites
mobilenvidia

I've just installed my GTX470 in my Sons P4 system.

Installed 263.06 and Cuda Toolkit

I still get the side by side error that I also get with my Laptop.

Something else is needed to run this ?

Would be quite a ration P4 vs GTX470, if I can ever get it to run

Share this post


Link to post
Share on other sites
Bill

I think you need Visual Studio 2008 SP1 redistributable.

Most users here probably have one or more of them from games or other programs they have.

http://www.microsoft.com/downloads/en/details.aspx?familyid=A5C84275-3B97-4AB7-A40D-3802B2AF5FC2&displaylang=en x86

http://www.microsoft.com/downloads/en/details.aspx?familyid=bd2a6171-e2d6-4230-b809-9a8d7548c1b6&displaylang=en x86_x64

Edit: I repackaged the program with more DLLs. I think you were missing the OpenMP ones that are used for the multi-threading.

Share this post


Link to post
Share on other sites
mobilenvidia

Thats better.

Installed VS2k8 on the P4 system and it's working on the CPU score now (taking ages)

Didn't install but ran app with extra DLL's on Laptop and it now runs and exits with no in/output (no error)

I take it it checks for a CUDA GPU before running ?

Share this post


Link to post
Share on other sites
mobilenvidia

Ran this on:

Pentium 4 @ 3.0Ghz with HT

1GB RAM (dual channel)

GTX470

******Matrix Multiplication Performance Analysis CUDA program*******

based on Nvidia reference program with OpenMP for CPU multithreading

Select which GPU to run the test on. Enter 1 for the first GPU, etc.

1

Select the number of threads for the CPU test.

1

Select the block multiple for the matrix size. (version one is 96)

(64 for 1024x1024, 96 for 1536x1536, 128 for 2048x2048, etc.)

96

device name: GeForce GTX 470 <----- creating CUDA context on this device

device sharedMemPerBlock: 49152

device totalGlobalMem: 1341718528

device regsPerBlock: 32768

device warpSize: 32

device memPitch: 2147483647

device maxThreadsPerBlock: 1024

device maxThreadsDim[0]: 1024

device maxThreadsDim[1]: 1024

device maxThreadsDim[2]: 64

device maxGridSize[0]: 65535

device maxGridSize[1]: 65535

device maxGridSize[2]: 1

device totalConstMem: 65536

device major: 2

device minor: 0

device clockRate: 1215000

device textureAlignment: 512

device deviceOverlap: 1

device multiProcessorCount: 14

Total CUDA cores: 112 should be 448

Processing time for GPU: 47 (ms)

Processing time for CPU 1 thread: 164531 (ms)

Processing time for CPU 1 threads: 164344 (ms)

CPU multithread speedup: 1.001138, efficiency: 100.113785

CPU to GPU time ratio (CUDA Speedup): 3496.680908

Can anybody beat that ratio ? :)

Clocks for GTX470 were default 607/1214/1674Mhz (Core/Shader/Memory)

47ms seems a little slow for GTX470, wonder what will happen when I get a Quad Core 2.93Ghz CPU running this

Share this post


Link to post
Share on other sites
mobilenvidia
device clockRate: 1700000

device textureAlignment: 512

device deviceOverlap: 1

device multiProcessorCount: 14

Total CUDA cores: 112 should 448

Processing time for GPU: 15 (ms)

Processing time for CPU 1 thread: 50578 (ms)

Processing time for CPU 1 threads: 51219 (ms)

CPU multithread speedup: 0.987485, efficiency: 98.748512

CPU to GPU time ratio (CUDA Speedup): 3414.600098

This is with GTX running at 850/1700/2002Mhz (Clock/Shader/Mem

Bill I noticed that the Cuda cores detected is slightly wrong (ie 112 detected * 4 = 448 actual)

Huge jump in speed with this over clock.

The GPU time was nearly 1/3 but so was the CPU which I did not over clock so the ratio should have been rather huge.

Share this post


Link to post
Share on other sites
Bill

The Total Cuda core reporting part was just this code:

totalCudaCores = (deviceProperty.multiProcessorCount*8);
printf("Total CUDA cores: %d \n", totalCudaCores);

So I just hard coded a value of 8.

I found out that device major and device minor are the CUDA compute version.

Currently there is 1.0 (first 8 series), 1.1 (most 8,9 series), 1.2 (slower gt200 series), 1.3 (GTX 265-295), 2.0 (First gen Fermi) and 2.1 is the newer Fermi cards.

Here is the fixed code.

if (deviceProperty.major == 1)
{
totalCudaCores = (deviceProperty.multiProcessorCount*8);
printf("Total CUDA cores: %d \n", totalCudaCores);
}
else if (deviceProperty.major == 2)
{
if (deviceProperty.minor == 0)
{
	totalCudaCores = (deviceProperty.multiProcessorCount*32);
	printf("Total CUDA cores: %d \n", totalCudaCores);
}
else if (deviceProperty.minor == 1)
{
	totalCudaCores = (deviceProperty.multiProcessorCount*48);
	printf("Total CUDA cores: %d \n", totalCudaCores);
}
else
{
	printf("Total CUDA cores unknown, version 2.");
	printf("%d was released after this software was written.\n", deviceProperty.minor);
}
}
else
{
printf("Total CUDA cores unknown, version %d.", deviceProperty.major);
printf("%d was released after this software was written.\n", deviceProperty.minor);
}

This will have no impact on the actual computation code provided by nvidia.

It was written generically and the CUDA compiler will work its magic on the code to make it run decent on Fermis as well as older GPUs. (hopefully, nvidia is still improving this)

In the if statements you could put switches for different abilities. For example early hardware supports only 24bit integer multiply, and 1.3+ supports double precision floating point.

Pieter you should increase my post upload limit, it won't let me upload a file over 1 MB.

Edit: I updated the program again so if you set threads to 0 it won't compute on the CPU. You can also enter the 3 options at the same time.

Here is my GTX running 8192*8192 Matrices for comparison. With these larger numbers we can compare GPUs better.

GTX275MatrixTest.png

Page 37 of this document gives more detail about what is going on on the GPU side, as well as other in depth info.

http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_BestPracticesGuide_3.1.pdf

The max number I can run on my GTX 275 is about 532x16 = 8512x8512 matrices.

Anything higher and it runs out of memory and crashes. For 532 I get 5600 ms on my GTX.

The amount of memory allocated for the 3 matrices on the GPU is like this.

((532*16)^2)(3 matrices)(4 bytes per float)/1024/1024 = 829.171875 MB out of 896 MB total.

So about 66MB is reserved for other things on my system (like maybe the dual displays I am running and other code variables)

Share this post


Link to post
Share on other sites
mobilenvidia
******Matrix Multiplication Performance Analysis CUDA program*******

based on Nvidia reference program with OpenMP for CPU multithreading

Select which GPU to run the test on. Enter 1 for the first GPU, etc.

1

Select the number of threads for the CPU test.

0

Select the block multiple for the matrix size. (version one is 96)

(64 for 1024x1024, 96 for 1536x1536, 128 for 2048x2048, etc.)

512

device name: GeForce GTX 470 <----- creating CUDA context on this device

device sharedMemPerBlock: 49152

device totalGlobalMem: 1341718528

device regsPerBlock: 32768

device warpSize: 32

device memPitch: 2147483647

device maxThreadsPerBlock: 1024

device maxThreadsDim[0]: 1024

device maxThreadsDim[1]: 1024

device maxThreadsDim[2]: 64

device maxGridSize[0]: 65535

device maxGridSize[1]: 65535

device maxGridSize[2]: 1

device totalConstMem: 65536

device major: 2

device minor: 0

device clockRate: 1215000

device textureAlignment: 512

device deviceOverlap: 1

device multiProcessorCount: 14

Total CUDA cores: 448

Processing time for GPU: 12672 (ms)

Cores all fixed, well done Bill

I feel a new Benchmark standard coming on :)

Share this post


Link to post
Share on other sites
mobilenvidia
******Matrix Multiplication Performance Analysis CUDA program*******

based on Nvidia reference program with OpenMP for CPU multithreading

Select which GPU to run the test on. Enter 1 for the first GPU, etc.

1

Select the number of threads for the CPU test.

0

Select the block multiple for the matrix size. (version one is 96)

(64 for 1024x1024, 96 for 1536x1536, 128 for 2048x2048, etc.)

512

Processing time for GPU: 6328 (ms)

Running it again 1/2 the time taken, caching at play here ?

512+ not working

Max Overclock = 8406ms can't work this out, even after a few goes

Share this post


Link to post
Share on other sites
Bill

Maybe it went faster the second time because your PC already had main memory reserved or something.

Nvidia's code for timing the GPU included the time to transfer the result back to the PC. (but not the time to send the data to the GPU)

Its probably important to include that in the overall bench time to identify PCI-E and RAM bottlenecks.

A researcher running math or something on the GPU will need to pull the data back into system ram to store on a hard drive or something.

No matter how fast your GPU is if you have to wait longer to transfer results from it then it took to calculate them, you have a definite bottleneck.

Share this post


Link to post
Share on other sites
Guest ptrein

Core i7 950 + GTX 460 SLI

******Matrix Multiplication Performance Analysis CUDA program*******
based on Nvidia reference program with OpenMP for CPU multithreading

Select which GPU to run the test on. Enter 1 for the first GPU, etc.
2
Select the number of threads for the CPU test.
8
Select the block multiple for the matrix size. (version one is 96)
(64 for 1024x1024, 96 for 1536x1536, 128 for 2048x2048, etc.)
96

device name: GeForce GTX 460
device sharedMemPerBlock: 49152
device totalGlobalMem: 1041694720
device regsPerBlock: 32768
device warpSize: 32
device memPitch: 2147483647
device maxThreadsPerBlock: 1024
device maxThreadsDim[0]: 1024
device maxThreadsDim[1]: 1024
device maxThreadsDim[2]: 64
device maxGridSize[0]: 65535
device maxGridSize[1]: 65535
device maxGridSize[2]: 1
device totalConstMem: 65536
device major: 2
device minor: 1
device clockRate: 810000
device textureAlignment: 512
device deviceOverlap: 1
device multiProcessorCount: 7
Total CUDA cores: 336


device name: GeForce GTX 460    <----- creating CUDA context on this device
device sharedMemPerBlock: 49152
device totalGlobalMem: 1041694720
device regsPerBlock: 32768
device warpSize: 32
device memPitch: 2147483647
device maxThreadsPerBlock: 1024
device maxThreadsDim[0]: 1024
device maxThreadsDim[1]: 1024
device maxThreadsDim[2]: 64
device maxGridSize[0]: 65535
device maxGridSize[1]: 65535
device maxGridSize[2]: 1
device totalConstMem: 65536
device major: 2
device minor: 1
device clockRate: 810000
device textureAlignment: 512
device deviceOverlap: 1
device multiProcessorCount: 7
Total CUDA cores: 336

Processing time for GPU: 62 (ms)
Processing time for CPU 1 thread: 24383 (ms)
Processing time for CPU 8 threads: 6334 (ms)
CPU multithread speedup: 3.849542, efficiency: 48.119278
CPU to GPU time ratio (CUDA Speedup): 102.161293

Hit any key to terminate

Share this post


Link to post
Share on other sites
Guest ptrein

Core i7 620M + NVS 3100M

******Matrix Multiplication Performance Analysis CUDA program*******
based on Nvidia reference program with OpenMP for CPU multithreading

Select which GPU to run the test on. Enter 1 for the first GPU, etc.
1
Select the number of threads for the CPU test.
4
Select the block multiple for the matrix size. (version one is 96)
(64 for 1024x1024, 96 for 1536x1536, 128 for 2048x2048, etc.)
96

device name: NVS 3100M    <----- creating CUDA context on this device
device sharedMemPerBlock: 16384
device totalGlobalMem: 497549312
device regsPerBlock: 16384
device warpSize: 32
device memPitch: 2147483647
device maxThreadsPerBlock: 512
device maxThreadsDim[0]: 512
device maxThreadsDim[1]: 512
device maxThreadsDim[2]: 64
device maxGridSize[0]: 65535
device maxGridSize[1]: 65535
device maxGridSize[2]: 1
device totalConstMem: 65536
device major: 1
device minor: 2
device clockRate: 1468000
device textureAlignment: 256
device deviceOverlap: 1
device multiProcessorCount: 2
Total CUDA cores: 16

Processing time for GPU: 562 (ms)
Processing time for CPU 1 thread: 41917 (ms)
Processing time for CPU 4 threads: 18673 (ms)
CPU multithread speedup: 2.244792, efficiency: 56.119801
CPU to GPU time ratio (CUDA Speedup): 33.225979

Hit any key to terminate

Share this post


Link to post
Share on other sites
Guest ptrein

Core 2 Duo E8500 + GTS 450

******Matrix Multiplication Performance Analysis CUDA program*******
based on Nvidia reference program with OpenMP for CPU multithreading

Select which GPU to run the test on. Enter 1 for the first GPU, etc.
1
Select the number of threads for the CPU test.
2
Select the block multiple for the matrix size. (version one is 96)
(64 for 1024x1024, 96 for 1536x1536, 128 for 2048x2048, etc.)
96

device name: GeForce GTS 450    <----- creating CUDA context on this device
device sharedMemPerBlock: 49152
device totalGlobalMem: 1041694720
device regsPerBlock: 32768
device warpSize: 32
device memPitch: 2147483647
device maxThreadsPerBlock: 1024
device maxThreadsDim[0]: 1024
device maxThreadsDim[1]: 1024
device maxThreadsDim[2]: 64
device maxGridSize[0]: 65535
device maxGridSize[1]: 65535
device maxGridSize[2]: 1
device totalConstMem: 65536
device major: 2
device minor: 1
device clockRate: 1850000
device textureAlignment: 512
device deviceOverlap: 1
device multiProcessorCount: 4
Total CUDA cores: 192

Processing time for GPU: 94 (ms)
Processing time for CPU 1 thread: 14804 (ms)
Processing time for CPU 2 threads: 7629 (ms)
CPU multithread speedup: 1.940490, efficiency: 97.024513
CPU to GPU time ratio (CUDA Speedup): 81.159576

Hit any key to terminate

Share this post


Link to post
Share on other sites
Guest ptrein

9400 GT vs Athlon II X2 240 @ 3.45 ghz

Processing time for GPU: 1029 (ms)

Processing time for CPU 1 thread: 64678 (ms)

Processing time for CPU 2 threads: 45802 (ms)

CPU multithread speedup: 1.412122, efficiency: 70.606087

CPU to GPU time ratio (CUDA Speedup): 44.511177

What's with the horrible multithread speedup? Either cos it's an AMD chip, or cos it's 2 cores not 4, or cos of my big OC (0.65 ghz)?

I too noticed that CPU performance tends to favor the Intel Core micro-architecture (which might indicate that the code/library was optimized for it). For example, here are the single-thread CPU results from three of my systems:

[Core i7 950 / Nehalem / Bloomfield]  Processing time for CPU 1 thread: 24383 (ms)
[Core i7 620M / Nehalem / Arrandale]  Processing time for CPU 1 thread: 41917 (ms)
[Core 2 Duo E8500 / Core / Wolfdale]  Processing time for CPU 1 thread: 14804 (ms)

Share this post


Link to post
Share on other sites
Guest ptrein
device name: GeForce GTX 460 <----- creating CUDA context on this device

...

device clockRate: 810000 (should be 1430000)

...

Noticed the shader clock rate for the GTX 460 is off, but correct for the GTS 450 (another CUDA compute capability version 2.1 device).

Share this post


Link to post
Share on other sites
Bill

I just tried changing the clock rate and the program reports the same number as before.

It reads the capabilities from the GPU firmware, and there is a different function to determine the actual clock rate.

(cudaGetDeviceProperties() vs cuDeviceGetAttribute()) The second should return the actual clock rate.

I am going to try the new function now.

I'm still telling you guys that I'm not trying to optimize this for Core 2's. I would have to go change something (complicated) in visual studio and use the intel compiler, which I don't have on my Clevo laptop.

While the laptop has a Q6600, this is standard visual studio compiling, probably with older SSE sets.

I will try adding more SSE to this though. Maybe I can put multiple code paths in so if you have SSE4 you can use it, that might speed it up on newer CPUs.

If I was putting loads of optimizations in this thing than SSE4 CPUs would be the fastest as they have floating point multiply and accumulate type instructions. (maybe core i7 only)

These would do ops in one instruction instead of multiple. I don't know how many cycles each type of CPU would need for these though.

Now my Q9650 might be real fast cause the 12MB cache might be the best for this type of application, or at least the way the code runs on it.

All the AMD CPUs I have seen benched in this thread only have 2 MB cache per 2 cores I think, where as my Core 2 has 6MB.

Also M$ was lazy and didn't put OpenMP 3.0 support in visual studio yet. (2.0 is OLD)

I can probably try compiling the CPU part of the program with a GCC or intel version with OpenMP 3.

Maybe the old version of OpenMP has some performance issues. The Itanium super computer that is practically down the street from my apartment even has newer OpenMP.

http://www.asc.edu/html/altix.shtml (we get to use this for class, until Friday anyway)

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

×
×
  • Create New...