Bill Posted November 28, 2010 Report Share Posted November 28, 2010 So I've been working on my CUDA project for class. I have this matrix multiplication program that can multiply two 1,536x1,536 dimension matrices using a single Nvidia GPU, a single thread on your primary CPU, and 4 threads using OpenMP. This was compiled with visual studio 2010 professional (using V90 compilers with Nvidia parallel Nsight 1.5) Now it is probably possible to use the Intel Compiler with SSE4 to speed up the CPU implementation somewhat, but I'll mess with that later when I have some more time. I have attached 32 and 64 bit windows builds of the program so you guys can test it out. Right now the interface is very basic, it is hard coded to do 4 threads for the CPU, so if you have a dual core the efficiency won't be what it should, but you will still see what the CUDA speedup is vs your CPU. The program was targeted at my 2 development machines that both have Intel quad cores and 2 nvidia GPUs. You have to select which GPU to use. (just select one if you only have one) First up we have an integrated GPU, the 9300. Even with only 16 CUDA cores it manages to beat out an over clocked Q9650 running 4 threads by a factor of more than 4. (the final time ratio at the bottem shows the 4 thread CPU vs GPU time) Next we have my Clevo D901C SLI laptop running a stock speed Q6600. It is only running 1 GPU here so if I changed the code to run on both GPUs, the GPU time would be about cut in half. With both GPUs the laptop has more than 30 times the floating point power on its SLI system than the Q6600 has. Next we have my EVGA FTW edition GTX 275 with a speedup of over 90 vs the Q9650 running 4 threads. I don't know why it was sometimes a lot faster, probably something with the cache. There are probably some communication times and cache effects affecting these numbers, and the hard coded matrix size might be a little low to see full GPU stress, although it is high enough to see good efficiency on an intel qaud. Anyway this is just something neat to experiment with. You guys should be able to run the exes no problem hopefully. I will test this on my XP64 machine that doesn't have any dev tools installed. I might upgrade the UI a bit and upload a new version later. Edit: As Pieter found out the hard way you need a GF 8 series or higher or you will get a funny error. (they don't support proper CUDA) Also I just tested this on a dual core machine and the performance was much worse than it should be. I'll need to add that thread option to the command argument. Edit2: I just added a new version of the program, you can change the number of threads and block size. If you choose a large block size it might take a while to run, especially if you hit a memory bottleneck somewhere. I think my Q9650 runs into a bottleneck right around 2048x2048 and the performance and multi-threading efficiency goes down dramatically. I have other programs running idle in the background though. It would be interesting to see where the memory bottleneck is on Core 2 vs Core i7 Machines. CudaPerformance3.7z Link to comment Share on other sites More sharing options...
Blacky Posted November 28, 2010 Report Share Posted November 28, 2010 Nvidia 280M vs. QX9300 @ 2.93 GHz. Link to comment Share on other sites More sharing options...
mobilenvidia Posted November 28, 2010 Report Share Posted November 28, 2010 Now I'm looking forward to CPU arriving even more. The X3470 with a GTX470 should do rather well with this. You possibly also may need to adjust the CPU treads up or down. Nahelem CPU's with HT can go as high as 16 threads. Link to comment Share on other sites More sharing options...
Guest domenico Posted November 28, 2010 Report Share Posted November 28, 2010 i've done 8600m gt (32 cuda cores) vs t7500 (2 cores) Here is the result http://yfrog.com/f3capturezgj I think that optimizing the exe for dual cores too would have made a better result Link to comment Share on other sites More sharing options...
Bill Posted November 28, 2010 Author Report Share Posted November 28, 2010 I just updated the program. Keep in mind this is not the most efficient way to run Matrix Multiplication on an Intel quad core, SSE4.x instructions might improve this somewhat. If I ever get ICC to run correctly it might increase the CPU speed by up to 50%. The CUDA part was written in a very optimized manner by Nvidia. Even if the CPU program was optimized, my Q9650 would still get its butt kicked by a 9300 though. Link to comment Share on other sites More sharing options...
Guest Mirko Posted November 28, 2010 Report Share Posted November 28, 2010 gtx285 vs phenom 2 965be @default Link to comment Share on other sites More sharing options...
Guest Sensei Posted November 29, 2010 Report Share Posted November 29, 2010 This software looks interesting :) x64 version (x32 seems to crash on my W7 btw) Select which GPU to run the test on. Enter 1 for the first GPU, etc. 1 Select the number of threads for the CPU test. 2 Select the block multiple for the matrix size. (version one is 96) (64 for 1024x1024, 96 for 1536x1536, 128 for 2048x2048, etc.) 96 device name: GeForce GTS 250 <----- creating CUDA context on this device device sharedMemPerBlock: 16384 device totalGlobalMem: 511246336 device regsPerBlock: 8192 device warpSize: 32 device memPitch: 2147483647 device maxThreadsPerBlock: 512 device maxThreadsDim[0]: 512 device maxThreadsDim[1]: 512 device maxThreadsDim[2]: 64 device maxGridSize[0]: 65535 device maxGridSize[1]: 65535 device maxGridSize[2]: 1 device totalConstMem: 65536 device major: 1 device minor: 1 device clockRate: 1836000 device textureAlignment: 256 device deviceOverlap: 1 device multiProcessorCount: 16 Total CUDA cores: 128 Processing time for GPU: 78 (ms) Processing time for CPU 1 thread: 20748 (ms) Processing time for CPU 2 threads: 11793 (ms) CPU multithread speedup: 1.759349, efficiency: 87.967438 CPU to GPU time ratio (CUDA Speedup): 151.192307 Link to comment Share on other sites More sharing options...
karl_w_w Posted November 29, 2010 Report Share Posted November 29, 2010 9400 GT vs Athlon II X2 240 @ 3.45 ghz Processing time for GPU: 1029 (ms) Processing time for CPU 1 thread: 64678 (ms) Processing time for CPU 2 threads: 45802 (ms) CPU multithread speedup: 1.412122, efficiency: 70.606087 CPU to GPU time ratio (CUDA Speedup): 44.511177 What's with the horrible multithread speedup? Either cos it's an AMD chip, or cos it's 2 cores not 4, or cos of my big OC (0.65 ghz)? Link to comment Share on other sites More sharing options...
Bill Posted November 29, 2010 Author Report Share Posted November 29, 2010 It could be the combination of the inferior architecture on your chip (compared to a good core 2 with more cache) and maybe its throttling with the OC if its too hot. You might have a memory bottleneck somewhere (cache or RAM or something) or maybe other things running on your system. You might be running really large matrix dim to get times that long. (You didn't say which you tested) Try with 96 block multiple. Also I think VS has an older version of OpenMP built in (maybe) that could cause a decrease. I am not putting intel specific code in this program though, so that is not the problem. Link to comment Share on other sites More sharing options...
Guest Blubbi Posted November 29, 2010 Report Share Posted November 29, 2010 core I3 Geforce GT330M Link to comment Share on other sites More sharing options...
mobilenvidia Posted November 29, 2010 Report Share Posted November 29, 2010 I've just installed my GTX470 in my Sons P4 system. Installed 263.06 and Cuda Toolkit I still get the side by side error that I also get with my Laptop. Something else is needed to run this ? Would be quite a ration P4 vs GTX470, if I can ever get it to run Link to comment Share on other sites More sharing options...
Bill Posted November 29, 2010 Author Report Share Posted November 29, 2010 I think you need Visual Studio 2008 SP1 redistributable. Most users here probably have one or more of them from games or other programs they have. http://www.microsoft.com/downloads/en/details.aspx?familyid=A5C84275-3B97-4AB7-A40D-3802B2AF5FC2&displaylang=en x86 http://www.microsoft.com/downloads/en/details.aspx?familyid=bd2a6171-e2d6-4230-b809-9a8d7548c1b6&displaylang=en x86_x64 Edit: I repackaged the program with more DLLs. I think you were missing the OpenMP ones that are used for the multi-threading. Link to comment Share on other sites More sharing options...
mobilenvidia Posted November 30, 2010 Report Share Posted November 30, 2010 Thats better. Installed VS2k8 on the P4 system and it's working on the CPU score now (taking ages) Didn't install but ran app with extra DLL's on Laptop and it now runs and exits with no in/output (no error) I take it it checks for a CUDA GPU before running ? Link to comment Share on other sites More sharing options...
mobilenvidia Posted November 30, 2010 Report Share Posted November 30, 2010 Ran this on: Pentium 4 @ 3.0Ghz with HT 1GB RAM (dual channel) GTX470 ******Matrix Multiplication Performance Analysis CUDA program*******based on Nvidia reference program with OpenMP for CPU multithreading Select which GPU to run the test on. Enter 1 for the first GPU, etc. 1 Select the number of threads for the CPU test. 1 Select the block multiple for the matrix size. (version one is 96) (64 for 1024x1024, 96 for 1536x1536, 128 for 2048x2048, etc.) 96 device name: GeForce GTX 470 <----- creating CUDA context on this device device sharedMemPerBlock: 49152 device totalGlobalMem: 1341718528 device regsPerBlock: 32768 device warpSize: 32 device memPitch: 2147483647 device maxThreadsPerBlock: 1024 device maxThreadsDim[0]: 1024 device maxThreadsDim[1]: 1024 device maxThreadsDim[2]: 64 device maxGridSize[0]: 65535 device maxGridSize[1]: 65535 device maxGridSize[2]: 1 device totalConstMem: 65536 device major: 2 device minor: 0 device clockRate: 1215000 device textureAlignment: 512 device deviceOverlap: 1 device multiProcessorCount: 14 Total CUDA cores: 112 should be 448 Processing time for GPU: 47 (ms) Processing time for CPU 1 thread: 164531 (ms) Processing time for CPU 1 threads: 164344 (ms) CPU multithread speedup: 1.001138, efficiency: 100.113785 CPU to GPU time ratio (CUDA Speedup): 3496.680908 Can anybody beat that ratio ? :) Clocks for GTX470 were default 607/1214/1674Mhz (Core/Shader/Memory) 47ms seems a little slow for GTX470, wonder what will happen when I get a Quad Core 2.93Ghz CPU running this Link to comment Share on other sites More sharing options...
mobilenvidia Posted November 30, 2010 Report Share Posted November 30, 2010 device clockRate: 1700000device textureAlignment: 512 device deviceOverlap: 1 device multiProcessorCount: 14 Total CUDA cores: 112 should 448 Processing time for GPU: 15 (ms) Processing time for CPU 1 thread: 50578 (ms) Processing time for CPU 1 threads: 51219 (ms) CPU multithread speedup: 0.987485, efficiency: 98.748512 CPU to GPU time ratio (CUDA Speedup): 3414.600098 This is with GTX running at 850/1700/2002Mhz (Clock/Shader/Mem Bill I noticed that the Cuda cores detected is slightly wrong (ie 112 detected * 4 = 448 actual) Huge jump in speed with this over clock. The GPU time was nearly 1/3 but so was the CPU which I did not over clock so the ratio should have been rather huge. Link to comment Share on other sites More sharing options...
Bill Posted November 30, 2010 Author Report Share Posted November 30, 2010 The Total Cuda core reporting part was just this code: totalCudaCores = (deviceProperty.multiProcessorCount*8); printf("Total CUDA cores: %d \n", totalCudaCores); So I just hard coded a value of 8. I found out that device major and device minor are the CUDA compute version. Currently there is 1.0 (first 8 series), 1.1 (most 8,9 series), 1.2 (slower gt200 series), 1.3 (GTX 265-295), 2.0 (First gen Fermi) and 2.1 is the newer Fermi cards. Here is the fixed code. if (deviceProperty.major == 1) { totalCudaCores = (deviceProperty.multiProcessorCount*8); printf("Total CUDA cores: %d \n", totalCudaCores); } else if (deviceProperty.major == 2) { if (deviceProperty.minor == 0) { totalCudaCores = (deviceProperty.multiProcessorCount*32); printf("Total CUDA cores: %d \n", totalCudaCores); } else if (deviceProperty.minor == 1) { totalCudaCores = (deviceProperty.multiProcessorCount*48); printf("Total CUDA cores: %d \n", totalCudaCores); } else { printf("Total CUDA cores unknown, version 2."); printf("%d was released after this software was written.\n", deviceProperty.minor); } } else { printf("Total CUDA cores unknown, version %d.", deviceProperty.major); printf("%d was released after this software was written.\n", deviceProperty.minor); } This will have no impact on the actual computation code provided by nvidia. It was written generically and the CUDA compiler will work its magic on the code to make it run decent on Fermis as well as older GPUs. (hopefully, nvidia is still improving this) In the if statements you could put switches for different abilities. For example early hardware supports only 24bit integer multiply, and 1.3+ supports double precision floating point. Pieter you should increase my post upload limit, it won't let me upload a file over 1 MB. Edit: I updated the program again so if you set threads to 0 it won't compute on the CPU. You can also enter the 3 options at the same time. Here is my GTX running 8192*8192 Matrices for comparison. With these larger numbers we can compare GPUs better. Page 37 of this document gives more detail about what is going on on the GPU side, as well as other in depth info. http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_BestPracticesGuide_3.1.pdf The max number I can run on my GTX 275 is about 532x16 = 8512x8512 matrices. Anything higher and it runs out of memory and crashes. For 532 I get 5600 ms on my GTX. The amount of memory allocated for the 3 matrices on the GPU is like this. ((532*16)^2)(3 matrices)(4 bytes per float)/1024/1024 = 829.171875 MB out of 896 MB total. So about 66MB is reserved for other things on my system (like maybe the dual displays I am running and other code variables) Link to comment Share on other sites More sharing options...
mobilenvidia Posted November 30, 2010 Report Share Posted November 30, 2010 ******Matrix Multiplication Performance Analysis CUDA program*******based on Nvidia reference program with OpenMP for CPU multithreading Select which GPU to run the test on. Enter 1 for the first GPU, etc. 1 Select the number of threads for the CPU test. 0 Select the block multiple for the matrix size. (version one is 96) (64 for 1024x1024, 96 for 1536x1536, 128 for 2048x2048, etc.) 512 device name: GeForce GTX 470 <----- creating CUDA context on this device device sharedMemPerBlock: 49152 device totalGlobalMem: 1341718528 device regsPerBlock: 32768 device warpSize: 32 device memPitch: 2147483647 device maxThreadsPerBlock: 1024 device maxThreadsDim[0]: 1024 device maxThreadsDim[1]: 1024 device maxThreadsDim[2]: 64 device maxGridSize[0]: 65535 device maxGridSize[1]: 65535 device maxGridSize[2]: 1 device totalConstMem: 65536 device major: 2 device minor: 0 device clockRate: 1215000 device textureAlignment: 512 device deviceOverlap: 1 device multiProcessorCount: 14 Total CUDA cores: 448 Processing time for GPU: 12672 (ms) Cores all fixed, well done Bill I feel a new Benchmark standard coming on :) Link to comment Share on other sites More sharing options...
mobilenvidia Posted November 30, 2010 Report Share Posted November 30, 2010 ******Matrix Multiplication Performance Analysis CUDA program*******based on Nvidia reference program with OpenMP for CPU multithreading Select which GPU to run the test on. Enter 1 for the first GPU, etc. 1 Select the number of threads for the CPU test. 0 Select the block multiple for the matrix size. (version one is 96) (64 for 1024x1024, 96 for 1536x1536, 128 for 2048x2048, etc.) 512 Processing time for GPU: 6328 (ms) Running it again 1/2 the time taken, caching at play here ? 512+ not working Max Overclock = 8406ms can't work this out, even after a few goes Link to comment Share on other sites More sharing options...
Bill Posted November 30, 2010 Author Report Share Posted November 30, 2010 Maybe it went faster the second time because your PC already had main memory reserved or something. Nvidia's code for timing the GPU included the time to transfer the result back to the PC. (but not the time to send the data to the GPU) Its probably important to include that in the overall bench time to identify PCI-E and RAM bottlenecks. A researcher running math or something on the GPU will need to pull the data back into system ram to store on a hard drive or something. No matter how fast your GPU is if you have to wait longer to transfer results from it then it took to calculate them, you have a definite bottleneck. Link to comment Share on other sites More sharing options...
Guest ptrein Posted December 1, 2010 Report Share Posted December 1, 2010 Core i7 950 + GTX 460 SLI ******Matrix Multiplication Performance Analysis CUDA program******* based on Nvidia reference program with OpenMP for CPU multithreading Select which GPU to run the test on. Enter 1 for the first GPU, etc. 2 Select the number of threads for the CPU test. 8 Select the block multiple for the matrix size. (version one is 96) (64 for 1024x1024, 96 for 1536x1536, 128 for 2048x2048, etc.) 96 device name: GeForce GTX 460 device sharedMemPerBlock: 49152 device totalGlobalMem: 1041694720 device regsPerBlock: 32768 device warpSize: 32 device memPitch: 2147483647 device maxThreadsPerBlock: 1024 device maxThreadsDim[0]: 1024 device maxThreadsDim[1]: 1024 device maxThreadsDim[2]: 64 device maxGridSize[0]: 65535 device maxGridSize[1]: 65535 device maxGridSize[2]: 1 device totalConstMem: 65536 device major: 2 device minor: 1 device clockRate: 810000 device textureAlignment: 512 device deviceOverlap: 1 device multiProcessorCount: 7 Total CUDA cores: 336 device name: GeForce GTX 460 <----- creating CUDA context on this device device sharedMemPerBlock: 49152 device totalGlobalMem: 1041694720 device regsPerBlock: 32768 device warpSize: 32 device memPitch: 2147483647 device maxThreadsPerBlock: 1024 device maxThreadsDim[0]: 1024 device maxThreadsDim[1]: 1024 device maxThreadsDim[2]: 64 device maxGridSize[0]: 65535 device maxGridSize[1]: 65535 device maxGridSize[2]: 1 device totalConstMem: 65536 device major: 2 device minor: 1 device clockRate: 810000 device textureAlignment: 512 device deviceOverlap: 1 device multiProcessorCount: 7 Total CUDA cores: 336 Processing time for GPU: 62 (ms) Processing time for CPU 1 thread: 24383 (ms) Processing time for CPU 8 threads: 6334 (ms) CPU multithread speedup: 3.849542, efficiency: 48.119278 CPU to GPU time ratio (CUDA Speedup): 102.161293 Hit any key to terminate Link to comment Share on other sites More sharing options...
Guest ptrein Posted December 1, 2010 Report Share Posted December 1, 2010 Core i7 620M + NVS 3100M ******Matrix Multiplication Performance Analysis CUDA program******* based on Nvidia reference program with OpenMP for CPU multithreading Select which GPU to run the test on. Enter 1 for the first GPU, etc. 1 Select the number of threads for the CPU test. 4 Select the block multiple for the matrix size. (version one is 96) (64 for 1024x1024, 96 for 1536x1536, 128 for 2048x2048, etc.) 96 device name: NVS 3100M <----- creating CUDA context on this device device sharedMemPerBlock: 16384 device totalGlobalMem: 497549312 device regsPerBlock: 16384 device warpSize: 32 device memPitch: 2147483647 device maxThreadsPerBlock: 512 device maxThreadsDim[0]: 512 device maxThreadsDim[1]: 512 device maxThreadsDim[2]: 64 device maxGridSize[0]: 65535 device maxGridSize[1]: 65535 device maxGridSize[2]: 1 device totalConstMem: 65536 device major: 1 device minor: 2 device clockRate: 1468000 device textureAlignment: 256 device deviceOverlap: 1 device multiProcessorCount: 2 Total CUDA cores: 16 Processing time for GPU: 562 (ms) Processing time for CPU 1 thread: 41917 (ms) Processing time for CPU 4 threads: 18673 (ms) CPU multithread speedup: 2.244792, efficiency: 56.119801 CPU to GPU time ratio (CUDA Speedup): 33.225979 Hit any key to terminate Link to comment Share on other sites More sharing options...
Guest ptrein Posted December 1, 2010 Report Share Posted December 1, 2010 Core 2 Duo E8500 + GTS 450 ******Matrix Multiplication Performance Analysis CUDA program******* based on Nvidia reference program with OpenMP for CPU multithreading Select which GPU to run the test on. Enter 1 for the first GPU, etc. 1 Select the number of threads for the CPU test. 2 Select the block multiple for the matrix size. (version one is 96) (64 for 1024x1024, 96 for 1536x1536, 128 for 2048x2048, etc.) 96 device name: GeForce GTS 450 <----- creating CUDA context on this device device sharedMemPerBlock: 49152 device totalGlobalMem: 1041694720 device regsPerBlock: 32768 device warpSize: 32 device memPitch: 2147483647 device maxThreadsPerBlock: 1024 device maxThreadsDim[0]: 1024 device maxThreadsDim[1]: 1024 device maxThreadsDim[2]: 64 device maxGridSize[0]: 65535 device maxGridSize[1]: 65535 device maxGridSize[2]: 1 device totalConstMem: 65536 device major: 2 device minor: 1 device clockRate: 1850000 device textureAlignment: 512 device deviceOverlap: 1 device multiProcessorCount: 4 Total CUDA cores: 192 Processing time for GPU: 94 (ms) Processing time for CPU 1 thread: 14804 (ms) Processing time for CPU 2 threads: 7629 (ms) CPU multithread speedup: 1.940490, efficiency: 97.024513 CPU to GPU time ratio (CUDA Speedup): 81.159576 Hit any key to terminate Link to comment Share on other sites More sharing options...
Guest ptrein Posted December 1, 2010 Report Share Posted December 1, 2010 9400 GT vs Athlon II X2 240 @ 3.45 ghz Processing time for GPU: 1029 (ms) Processing time for CPU 1 thread: 64678 (ms) Processing time for CPU 2 threads: 45802 (ms) CPU multithread speedup: 1.412122, efficiency: 70.606087 CPU to GPU time ratio (CUDA Speedup): 44.511177 What's with the horrible multithread speedup? Either cos it's an AMD chip, or cos it's 2 cores not 4, or cos of my big OC (0.65 ghz)? I too noticed that CPU performance tends to favor the Intel Core micro-architecture (which might indicate that the code/library was optimized for it). For example, here are the single-thread CPU results from three of my systems: [Core i7 950 / Nehalem / Bloomfield] Processing time for CPU 1 thread: 24383 (ms) [Core i7 620M / Nehalem / Arrandale] Processing time for CPU 1 thread: 41917 (ms) [Core 2 Duo E8500 / Core / Wolfdale] Processing time for CPU 1 thread: 14804 (ms) Link to comment Share on other sites More sharing options...
Guest ptrein Posted December 1, 2010 Report Share Posted December 1, 2010 device name: GeForce GTX 460 <----- creating CUDA context on this device... device clockRate: 810000 (should be 1430000) ... Noticed the shader clock rate for the GTX 460 is off, but correct for the GTS 450 (another CUDA compute capability version 2.1 device). Link to comment Share on other sites More sharing options...
Bill Posted December 1, 2010 Author Report Share Posted December 1, 2010 I just tried changing the clock rate and the program reports the same number as before. It reads the capabilities from the GPU firmware, and there is a different function to determine the actual clock rate. (cudaGetDeviceProperties() vs cuDeviceGetAttribute()) The second should return the actual clock rate. I am going to try the new function now. I'm still telling you guys that I'm not trying to optimize this for Core 2's. I would have to go change something (complicated) in visual studio and use the intel compiler, which I don't have on my Clevo laptop. While the laptop has a Q6600, this is standard visual studio compiling, probably with older SSE sets. I will try adding more SSE to this though. Maybe I can put multiple code paths in so if you have SSE4 you can use it, that might speed it up on newer CPUs. If I was putting loads of optimizations in this thing than SSE4 CPUs would be the fastest as they have floating point multiply and accumulate type instructions. (maybe core i7 only) These would do ops in one instruction instead of multiple. I don't know how many cycles each type of CPU would need for these though. Now my Q9650 might be real fast cause the 12MB cache might be the best for this type of application, or at least the way the code runs on it. All the AMD CPUs I have seen benched in this thread only have 2 MB cache per 2 cores I think, where as my Core 2 has 6MB. Also M$ was lazy and didn't put OpenMP 3.0 support in visual studio yet. (2.0 is OLD) I can probably try compiling the CPU part of the program with a GCC or intel version with OpenMP 3. Maybe the old version of OpenMP has some performance issues. The Itanium super computer that is practically down the street from my apartment even has newer OpenMP. http://www.asc.edu/html/altix.shtml (we get to use this for class, until Friday anyway) Link to comment Share on other sites More sharing options...
Recommended Posts