Guest ptrein Posted December 21, 2010 Report Share Posted December 21, 2010 Fixed. I had some spare cycles during lunch break. This update supersedes all previous versions, which have been removed (link expires 12/8). Changes Updated HT detection code to eliminate false positives on Vista and newer operating systems, and reduce false positives on pre-Vista operating systems Reduced settling period to eight seconds between runs and added a two second spool up period for the parallel runs Link to comment Share on other sites More sharing options...
mobilenvidia Posted December 22, 2010 Report Share Posted December 22, 2010 The X3470 (ES) that I have is identical to the i7 875K but with ECC and VT-d plus some other stuff. Had hoped that the unlocked version would be unlocked but it's not really, still only get 9-22x Multiplier. But easily does 4Ghz, if I enable Turbo then 3.8Ghz (4.4Ghz) Makes the desktop very snappy compared even to the i3 @ 4.4Ghz Link to comment Share on other sites More sharing options...
mobilenvidia Posted December 22, 2010 Report Share Posted December 22, 2010 X3470 (ES) @ default 2.93Ghz CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8Matrix Size: 1536x1536 Processing time for CPU (sequential): 33578 ms Processing time for CPU (parallel HT on): 9511 ms Processing time for CPU (parallel HT off): 14210 ms Parallel speed-up: 3.530438, Efficiency: 88.260961% (HT on) Parallel speed-up: 2.362984, Efficiency: 59.074595% (HT off) Thats better :) Hmmm, my HT off makes for quite a difference. Looks like HT can do some actual work Link to comment Share on other sites More sharing options...
Guest ptrein Posted December 23, 2010 Report Share Posted December 23, 2010 I did major refactoring of the code to see how far I could push the 'envelope' in a managed language like C#; the results are not disappointing. :) Changes Now accepts command line options for multiplication algorithm and matrix size. Totally refactored matrix multiplication code. Program now implements two algorithms -- a naive version using a straightforward i,j,k loop and Strassen's algorithm. HT off simulation should work more consistently now. For those interested, the latest version of the program can be found here (link expires 12/30). Links to old versions have been removed. Link to comment Share on other sites More sharing options...
Guest ptrein Posted December 24, 2010 Report Share Posted December 24, 2010 All cores are not created equal. On my i7-950 the second core (CPU 1) is fastest, whereas on both my Core i7-620M and Core 2 Duo E8500 the first core (CPU 0) is fastest. Changes Sequential computation is now performed on each available core. The best runtime is used for the speed-up and efficiency calculations. Added a dry run prior to the timed runs to negate any ramp-up associated with dynamic overclocking. For those interested, the latest version of the program can be found here (link expires 12/31). Links to old versions have been removed. Merry Christmas and enjoy! :) Link to comment Share on other sites More sharing options...
mobilenvidia Posted December 25, 2010 Report Share Posted December 25, 2010 Core i5 460m CPU ID: Intel64 Family 6 Model 37 Stepping 5, Cores: 2, Logical Processors: 4Algorithm: Strassen's, Matrix Size: 1536x1536 Processing time for CPU 0 (sequential): 33141.6495 ms Processing time for CPU 1 (sequential): 33882.3437 ms Processing time for CPU * (parallel HT on ): 15413.6534 ms Processing time for CPU * (parallel HT off): 18159.8282 ms Parallel speed-up: 2.150149, Efficiency: 107.507444% (HT on) Parallel speed-up: 1.824998, Efficiency: 91.249898% (HT off) Link to comment Share on other sites More sharing options...
Guest ptrein Posted December 28, 2010 Report Share Posted December 28, 2010 Changes Added CUDA Toolkit 3.2 integration (finally)! CUDA matrix multiplications are performed on each available GPU using the CUBLAS optimized library. For those interested, the latest version of the program can be found here (link expires 1/4). Links to old versions have been removed. Link to comment Share on other sites More sharing options...
mobilenvidia Posted December 29, 2010 Report Share Posted December 29, 2010 Very nice app now :) Xeon x3470 @ 2.93Ghz CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8CUDA GPU: GeForce GTX 470, Cores: 448 Algorithm: Strassen's, Matrix Size: 1536x1536 Processing time for CPU 0 (sequential): 8445.7032 ms Processing time for CPU 1 (sequential): 8177.6437 ms Processing time for CPU 2 (sequential): 8141.7464 ms Processing time for CPU 3 (sequential): 8193.7191 ms Processing time for CPU * (parallel HT): 2202.4177 ms Processing time for CPU * (parallel !HT): 2304.8124 ms Processing time for GPU 0 (CUBLAS): 25.3861 ms Parallel speed-up: 3.696731, Efficiency: 92.418282% ( HT) Parallel speed-up: 3.532499, Efficiency: 88.312463% (!HT) CUDA GPU speed-up: 320.716707 Link to comment Share on other sites More sharing options...
Guest ptrein Posted December 30, 2010 Report Share Posted December 30, 2010 Thanks. All I did was take what Bill started, and pursued it down the path of whatever tickled my fancy. Hats off to Bill for starting this thread. :) Time permitting, I'm looking at SSE SIMD instructions and OpenCL for upcoming versions. This should get us really close to the limits of CPU performance and cater to the AMD Radeon owners out there. Link to comment Share on other sites More sharing options...
mobilenvidia Posted December 30, 2010 Report Share Posted December 30, 2010 Yes Bill the creator of all this does need kudos too :) He's too busy playing with his Panda box, we are neglected children Link to comment Share on other sites More sharing options...
Bill Posted December 30, 2010 Author Report Share Posted December 30, 2010 I haven't spent too much time with it, but I have built it into a case of sorts, so it actually resembles a tablet now. (Maybe I'll have pictures tomorrow) However between driver issues (new so lack of drivers) and overheating it needs some work before I can really make use of it. (need a fan or heat sink in the enclosed area) I also need to find a good touch screen to attach to it. Also will need to figure out how to set a good always-on on screen keyboard. I did get OpenMP running on it though. I will post up some numbers later when I get a chance to run more tests. Its one of the things that runs like it should. (wifi and bluetooth work as well) I will need to compare it to a core 2 machine running at 1 GHz. Between Christmas, birthdays, snow (The world is ending with the snow we are having here.), getting my car fixed multiple times, and driving/sliding all across town in snow/ice every day I've been busier over the break than before it started. We brought this new sport to Alabama called Yukon Ice Skating. Link to comment Share on other sites More sharing options...
Guest ptrein Posted January 4, 2011 Report Share Posted January 4, 2011 Changes Totally re-written GPGPU module that uses OpenCL* instead of CUDA Multi-GPU parallel compute on systems with more than one GPU [*] I have very limited access to Radeon cards, so it might not work 100% on that architecture. For those brave souls willing to give it a shot, the latest AMD Catalyst Accelerated Parallel Processing (APP) Technology Edition driver is required. For those interested, the latest version of the program can be found here (link expires 1/10). Links to old versions have been removed. Link to comment Share on other sites More sharing options...
mobilenvidia Posted January 4, 2011 Report Share Posted January 4, 2011 CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8GPGPU: GeForce GTX 470, Compute Units: 14 Algorithm: Strassen's, Matrix Size: 1536x1536 Processing time for CPU 0 (sequential): 7028.4161 ms Processing time for CPU 1 (sequential): 6803.7261 ms Processing time for CPU 2 (sequential): 6738.3781 ms Processing time for CPU 3 (sequential): 6796.9911 ms Processing time for CPU * (parallel HT): 1837.8873 ms Processing time for CPU * (parallel !HT): 1917.5827 ms Processing time for GPU 0 (OpenCL): 51.9128 ms Parallel speed-up: 3.666372, Efficiency: 91.659294% ( HT) Parallel speed-up: 3.513996, Efficiency: 87.849902% (!HT) GPGPU speed-up: 129.801862 I didn't get any of the GPU OpenCL tests, GPUz says it's enabled. But then again, it says PhysX isn't but it is. I see the GTX460 has 7+7 Compute units and the GTX470 has 14 Link to comment Share on other sites More sharing options...
Guest ptrein Posted January 4, 2011 Report Share Posted January 4, 2011 The GTX460 has 7 compute units. Mine's an SLI system, thus 7 + 7. The OpenCL results are there - fourth line from the bottom. Wow, the 470 is 32% faster than the 460 in the OpenCL test. That's impressive. :) Link to comment Share on other sites More sharing options...
mobilenvidia Posted January 4, 2011 Report Share Posted January 4, 2011 Ofcourse it is, I forgot your SLi setup :doh: I noticed that too. I'm running the CPU at 3.84Ghz (4.3Ghz Turbo) this might boost the performance. The GTX470 is running stock clocks. I may run some more scenarios, for something todo Link to comment Share on other sites More sharing options...
mobilenvidia Posted January 4, 2011 Report Share Posted January 4, 2011 CPU default 2.93Ghz (133x22, Turbo on) GPU default 608/837/1215Mhz CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8GPGPU: GeForce GTX 470, Compute Units: 14 Algorithm: Strassen's, Matrix Size: 1536x1536 Processing time for CPU 0 (sequential): 8171.4389 ms Processing time for CPU 1 (sequential): 8156.402 ms Processing time for CPU 2 (sequential): 8294.0286 ms Processing time for CPU 3 (sequential): 8581.3664 ms Processing time for CPU * (parallel HT): 2269.4997 ms Processing time for CPU * (parallel !HT): 2317.3234 ms Processing time for GPU 0 (OpenCL): 55.0925 ms Parallel speed-up: 3.593921, Efficiency: 89.848018% ( HT) Parallel speed-up: 3.519751, Efficiency: 87.993782% (!HT) GPGPU speed-up: 148.049226 CPU overclocked to 3520Mhz (160x22, Turbo on) GPU default 608/837/1215Mhz CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8GPGPU: GeForce GTX 470, Compute Units: 14 Algorithm: Strassen's, Matrix Size: 1536x1536 Processing time for CPU 0 (sequential): 7028.4161 ms Processing time for CPU 1 (sequential): 6803.7261 ms Processing time for CPU 2 (sequential): 6738.3781 ms Processing time for CPU 3 (sequential): 6796.9911 ms Processing time for CPU * (parallel HT): 1837.8873 ms Processing time for CPU * (parallel !HT): 1917.5827 ms Processing time for GPU 0 (OpenCL): 51.9128 ms Parallel speed-up: 3.666372, Efficiency: 91.659294% ( HT) Parallel speed-up: 3.513996, Efficiency: 87.849902% (!HT) GPGPU speed-up: 129.801862 CPU default 2.93Ghz GPU Overclocked to 850/975/1700 CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8GPGPU: GeForce GTX 470, Compute Units: 14 Algorithm: Strassen's, Matrix Size: 1536x1536 Processing time for CPU 0 (sequential): 8242.8043 ms Processing time for CPU 1 (sequential): 8042.2601 ms Processing time for CPU 2 (sequential): 8056.3077 ms Processing time for CPU 3 (sequential): 8095.631 ms Processing time for CPU * (parallel HT): 2187.2142 ms Processing time for CPU * (parallel !HT): 3161.9652 ms Processing time for GPU 0 (OpenCL): 42.6647 ms Parallel speed-up: 3.676942, Efficiency: 91.923554% ( HT) Parallel speed-up: 2.543437, Efficiency: 63.585931% (!HT) GPGPU speed-up: 188.499168 CPU Overclocked to 3520Mhz (160x22, Turbo on) GPU Overclocked to 850/975/1700 CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8GPGPU: GeForce GTX 470, Compute Units: 14 Algorithm: Strassen's, Matrix Size: 1536x1536 Processing time for CPU 0 (sequential): 6649.5392 ms Processing time for CPU 1 (sequential): 6602.9418 ms Processing time for CPU 2 (sequential): 6584.4823 ms Processing time for CPU 3 (sequential): 6614.7873 ms Processing time for CPU * (parallel HT): 1881.7401 ms Processing time for CPU * (parallel !HT): 1908.7813 ms Processing time for GPU 0 (OpenCL): 40.8434 ms Parallel speed-up: 3.499145, Efficiency: 87.478636% ( HT) Parallel speed-up: 3.449574, Efficiency: 86.239349% (!HT) GPGPU speed-up: 161.212884 Can conclude CPU speed makes only small difference in OpenCL Link to comment Share on other sites More sharing options...
Guest ptrein Posted January 5, 2011 Report Share Posted January 5, 2011 Changes Vectorized the OpenCL kernel code which provides ~35% performance improvement in GPU performance Added GPU capability checks to optimize the kernel build Added GPU constraint checks to detect and prevent resource overrun issues For those interested, the latest version of the program can be found here (link expires 1/12). Links to old versions have been removed. Link to comment Share on other sites More sharing options...
mobilenvidia Posted January 5, 2011 Report Share Posted January 5, 2011 C:\downloads\mm>matrixmultintUnhandled Exception: System.IO.FileNotFoundException: Could not load file or arse embly 'Cloo, Version=0.8.1.0, Culture=neutral, PublicKeyToken=null' or one of it s dependencies. The system cannot find the file specified. at matrixmultint.Program.Main(String[] args) I get a crash when I run it Link to comment Share on other sites More sharing options...
Guest ptrein Posted January 6, 2011 Report Share Posted January 6, 2011 Sorry, my bad. It's been fixed. New version to be posted very shortly. Link to comment Share on other sites More sharing options...
Guest ptrein Posted January 6, 2011 Report Share Posted January 6, 2011 Changes OpenCL functionality now enumerates and runs on all supported platforms and devices (including CPUs)1 Since the OpenCL kernel code is vectorized, the corresponding CPU code is SSE optimized and will offer more than twice the performance on the CPU2 [1] To enable OpenCL for CPUs, ATI Stream SDK 2.3 needs to be installed. Install at your own risk since this will override the NVIDIA OpenCL Installable Client Driver (ICD). While I have not run into any issues, I strongly recommend backing up C:/Windows/System32/OpenCL.dll and C:/Windows/SysWOW64/OpenCL.dll prior to installing the Stream SDK. [2] from the screenshot below, Platform 1 Device 0 is the CPU running vectorized (SSE) code on all eight logical processors. We're looking at a speed-up of 9.128770 over the serial and 2.355655 over the fully parallelized Stressen's implementation! For those interested, the latest version of the program can be found here (link expires 1/13). Links to old versions have been removed. Link to comment Share on other sites More sharing options...
mobilenvidia Posted January 6, 2011 Report Share Posted January 6, 2011 Not using ATI Stream SDK 2.3 Algorithm: Strassen's, Matrix Size: 1536x1536OpenCL Platforms: 1, OpenCL Devices: 1 Platform 0 Device 0: GeForce GTX 470, Compute Units: 14 CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8 Processing time for CPU 0 (sequential): 6595.9145 ms Processing time for CPU 1 (sequential): 6569.3299 ms Processing time for CPU 2 (sequential): 6566.2564 ms Processing time for CPU 3 (sequential): 6572.6406 ms Processing time for CPU * (parallel HT): 1829.472 ms Processing time for CPU * (parallel !HT): 1908.4017 ms Processing time for Platform 0 Device 0 (OpenCL): 35.427 ms Parallel speed-up: 3.589154, Efficiency: 89.728845% ( HT) Parallel speed-up: 3.440710, Efficiency: 86.017745% (!HT) OpenCL speed-up: 185.346103 Using ATI Stream SDK 2.3 Algorithm: Strassen's, Matrix Size: 1536x1536OpenCL Platforms: 2, OpenCL Devices: 2 Platform 0 Device 0: GeForce GTX 470, Compute Units: 14 Platform 1 Device 0: Intel® Xeon® CPU X3470 @ 2.93GHz, Compute Units: 8 CPU ID: Intel64 Family 6 Model 30 Stepping 5, Cores: 4, Logical Processors: 8 Processing time for CPU 0 (sequential): 6778.4938 ms Processing time for CPU 1 (sequential): 6586.0664 ms Processing time for CPU 2 (sequential): 6570.9909 ms Processing time for CPU 3 (sequential): 6575.0096 ms Processing time for CPU * (parallel HT): 1857.3979 ms Processing time for CPU * (parallel !HT): 1907.4667 ms Processing time for Platform 0 Device 0 (OpenCL): 34.9568 ms Processing time for Platform 1 Device 0 (OpenCL): 762.894 ms Parallel speed-up: 3.537740, Efficiency: 88.443501% ( HT) Parallel speed-up: 3.444878, Efficiency: 86.121961% (!HT) OpenCL speed-up: 187.974612 One request to save coying/pasting, Command Prompting, any chance of a press any key to exit ? Then we can run with a double click :) Other than that, great work Link to comment Share on other sites More sharing options...
Guest ptrein Posted January 6, 2011 Report Share Posted January 6, 2011 Implemented as requested. :) Same link for the download. Link to comment Share on other sites More sharing options...
mobilenvidia Posted January 6, 2011 Report Share Posted January 6, 2011 Implemented as requested. :) Same link for the download. Well done. This is a nice little app to quickly bench CPU and GPU without grabbing huge files. I like how it stresses all 8 cores. Link to comment Share on other sites More sharing options...
mobilenvidia Posted January 7, 2011 Report Share Posted January 7, 2011 Thought I'd shift this over here Link to comment Share on other sites More sharing options...
Guest ptrein Posted January 8, 2011 Report Share Posted January 8, 2011 Good call. With Sandy Bridge around the corner, it'll be very interesting to see how well it performs on this test... Link to comment Share on other sites More sharing options...
Recommended Posts