stephenbrooks.org : Benchmarking with Mpts thread

Stephen Brooks
2010-01-11 15:08:22

[TA]Assimilator1 thanks for the list, the graph needed clearing out anyway because the bars were getting too small to read. You may remember I did this once before because the original bar chart I used in 2005-06 had a 333MHz P-II as its bottom entry. The second chart we're on now had a 3.06GHz P4 as the bottom entry and a minimum version of 4.43d.

--[Is that possible for you to offer for download a 'typical' simulation for benchmarking purposes?]--

In principle, yes, and you're not the first to ask for such a think (K`tetch asked repeatedly). Unfortunately because there is variation between simulations, such a fixed lattice would not really be representative of the current "average" over many designs that muon1bench gives.

Stephen Brooks
2010-01-11 15:09:11

As for me "editing" my posts, I got around to adding a delete post function, which I can use and then retype the latest post! (also remove double posts)

Stephen Brooks
2010-01-11 19:54:04

This afternoon I also fixed the muon1bench.exe available from the main Muon1 page, because it was producing a mess of ugly text output on one user's computer before.Â Anyway, here is the new graph.

[TA]Assimilator1 2010-01-12 23:05:40	Hi Stephen, thanks for the new graph . I did notice that some scores had been removed from previous graphs, I was just thinking that v4.43d scores shouldn't of been mixed with v4.44d scores previously. Got ya, re 'editing' posts. Btw would it be a big job to put in a real edit function for us all? >>>>>--[Is that possible for you to offer for download a 'typical' simulation for benchmarking purposes?]-- In principle, yes, and you're not the first to ask for such a think (K`tetch asked repeatedly). Unfortunately because there is variation between simulations, such a fixed lattice would not really be representative of the current "average" over many designs that muon1bench gives.<<<<< I take it you meant muon1? . How about a simulation which is 'averaged' over all (or many) simulations? Btw it wouldn't matter too much if it wasn't truely representitive of current simulations, you would still have a repeatabley accurate benchmark to be able to compare different h/w or s/w configs. With the current benchmarking system how much variation in score is there (on the same rig) between different simulations? (using the same client). Btw can BOINC users use muon1bench?
tomaz 2010-01-12 23:36:44	What's the point in benchmarking N-procesor machines with N-instaces ? Benchmark with 1 instance on N-procesors is what can be compared to other CPUs, nothing else. Similar although not the same is with multicore CPUs. I think all comparasions shoud be made with 1 instace run. For example what is difference runing 8 Muons on 8xopteron and 8 Muons on 8 i7 machines ? OK, difference is in speed but nobody claimed bench of 8xi7 because it is pointless.
Stephen Brooks 2010-01-13 15:28:56	--[Btw it wouldn't matter too much if it wasn't truely representitive of current simulations, you would still have a repeatabley accurate benchmark to be able to compare different h/w or s/w configs.]-- If you want that, just use SysMark. Muon1bench.exe measures the efficiency of your computer at doing real Muon1 workunits. As such, there is some variability from simulation to simulation (and lattice to lattice). Although Mpts tries to be a good way of measuring the calculation done, the overheads vary between different simulations with many/few particles in them (and this dependency might itself change depending on things like L2 cache). --[What's the point in benchmarking N-procesor machines with N-instaces ?]-- If you find a different arrangement of clients works better, you can use that. E.g. any of 1x8 threads, 2x4, 4x2 or 8x1 are valid for an 8-core machine. I'm not going to restrict people to having to use only a single instance if that is not the ideal configuration for their machine. In principle, the multithreader in Muon1 tries to make this unnecessary, but there are overheads when the number of particles becomes small. In these cases running multiple instances and relying on the OS thread scheduler can be more efficient.
Stephen Brooks 2010-01-13 15:33:22	--[Btw can BOINC users use muon1bench?]-- I think yoyo's BOINC adaptor produces a folder for Muon1 to work inside for each result completed (muon1bench.exe would have to be put in there). It may delete it afterwards. Probably for those people it's best to just download the standalone Muon1, run that for a while and put muon1bench in it. It's a bit of a meaningless question, and how is it supposed to work anyway if the BOINC client is miximg Muon1 with work from other projects?
[TA]Assimilator1 2010-01-15 18:30:29	>>>If you want that, just use SysMark.<<< Well that would be totally pointless seeing as we're trying to benchmark DPAD accurately! And I ask again, how much variation is there in benchmarking different simulations? I thought DPAD didn't really care about L2 cache size? Re BOINC, it isn't a meaningless question , because many people run DPAD in BOINC! But yes they would have to set the CPU to 100% DPAD to make it meaningful. I ask because people prefer to use instals they already have rather than adding another, I was trying to see if their was a way to get more people benchmarking DPAD & posting scores here. tomaz It used to be the case with previous clients (pre v4.44d) that running 1 instance/core was much faster. Of course people should state/show this is the case when submitting benchmarks.
Stephen Brooks 2010-01-15 18:52:31	--[And I ask again, how much variation is there in benchmarking different simulations?]-- If you are comparing individual simulations with each other, big, probably at least 10-20%, depending on the parameters of each one. That's why Muon1Bench calculates an average over many simulations. That is how to benchmark Muon1 accurately. If you just picked one simulation it wouldn't be representative so it would give a very precisely determined, but fairly meaningless artificial number. Benchmarking of BOINC setups sounds like it would be tricky, though I don't know in detail how Yoyo runs Muon1 in his BOINC project.
Stephen Brooks 2010-01-16 03:12:42	--[Got ya, re 'editing' posts. Btw would it be a big job to put in a real edit function for us all?]-- Took me about 90 minutes. Kind of complicated because the user must first ask to edit, send a signal to the site to display the edit box, then that edit box must send a second signal to the site to actually do the job, all the while checking permissions so users can't edit each others' posts, etc. It's probably got bugs, but try it out (editing only allowed for 15 minutes after the post).
[TA]Assimilator1 2010-01-16 14:19:35	Awesome!Â , I'll give the edit function a shot in a minute . [edit] I meant to ask, why the 15min limit on editing? [edit2] Yes the editing works! Re Simulations, sorry my bad, getting terminology mixed up, I meant optimisations not simulations!Â Is it possible to have a 'middle of the road' optimisation to use for benchmarking, hence making benchmarks repeatable & accurate.Â I assume different optimisations give different scores?? [Edited by [TA]Assimilator1 at 2010-01-16 14:20:20] [Edited by [TA]Assimilator1 at 2010-01-16 14:20:59]
Stephen Brooks 2010-01-17 00:00:22	--[Is it possible to have a 'middle of the road' optimisation to use for benchmarking]-- Maybe, I'm not sure.Â I'm certainly not going to waste time trying to construct one. The current average-of-results benchmark certainly looks good enough to distinguish between the different generations of processors in the above graph.Â If you want to do "differential benchmarking" - i.e. look for very small differences between subtly different setups, then I'd suggest using a fixed result in queue.txt and timing it.Â So the same simulation runs each time.Â It wouldn't be worth making a graph of it or using it as our overall benchmark though.
[TA]Assimilator1 2010-01-17 13:08:27	Fair enough, you're a busy man (how's the thesis going btw?), is thier any optimisation already done that could be considered 'middle of the road'? What sort of variation in scores could different optimisations give btw? Re differential benchmarking, yea a few others suggested that too I think, I'll give that a shot at some point, thanks. Although you said different simulations would have different dependencies, so I guess it should be limited to testing the same rig with different BIOS settings. Oh, forgot to say, thanks for adding the edit function , so far it's worked fine for me . Why the 15min limit though?
Stephen Brooks 2010-03-14 22:57:18	I just realised I hadn't actually muon1benched my home (Athlon X2) machine since I upgraded from a single core, so running that now. I put a time limit on the edit to stop silliness I've seen like people editing their posts to say something totally different months after posting. And when I say "people" I mean myself getting told off by a moderator for doing this on the United Devices message board (remember that?) nearly 10 years ago.
[TA]JonB 2010-03-17 04:05:55	Finally got around to benching my new (replacement) processor.Â 1696.4 kpt/sec for a Phenom X4 955, overclocked to 3.7Ghz with the stock AMD heatsink.Â 922.5mhz with 4.5 multiplier Bus speed of 205 mhz PC2-6400 memory, 5-5-5-15 2T Mushkin Perhaps time to change heatsinks and go for 4.0Ghz. [Edited by [TA]JonB at 2010-03-17 04:09:38]
Stephen Brooks 2010-03-17 12:51:05	That's the highest scoring and clocked Phenom on the board, though the two on there so far don't seem to make sense ([TA]Assimilator's has a higher clock speed but is slower?)
[TA]Assimilator1 2010-03-18 07:18:31	That's because that rig was running 64bit XP. Any developments on improving 64bit speed? Oh & I've never been to the UD forum, but every other forum I've been to has no time limit on editing. I see what you're getting at, but I think 15mins is rather low. [Edited by [TA]Assimilator1 at 2010-03-18 07:28:09]
[TA]JonB 2010-03-19 16:25:16	and the PC I benched is running 32bit XP. It seems to be consistently giving me a value of 1669 kpt/sec now (after another 24 hours of leaving it alone).
Maniacken [US-Distributed] 2010-05-27 23:51:10	Curious how one would go about running N instances? How would i set up running 1 2 4 6 8 12 muon instances and setting processor affinity? This is my first time having a multiple core computer.
Universal Creations 2010-06-03 01:01:37	Dell Studio 17 with i7-720QM at stock speed, 1 muon client 8 threads: about 2900 kpts/sec. Tomorrow i'll to maximize with 2 or 4 clients and maybe some extra clockspeed (if possible).
Stephen Brooks 2010-06-03 17:35:21	2900 seems a little high compared to the overclocked Core i7s already there - don't you mean 1900?
Maniacken [US-Distributed] 2010-06-05 01:50:14	Just got done running the new rig Core i7 980 6 core hyperthreading 12 gig memory Stock speed 3.33Ghz windows 7 64 bit ultimate 1 client running 93429,1744994.3,3312.72 96431,1754478.1,3307.71 96731,1755916.1,3312.55 97031,1756908.2,3312.53 So 3310 kpts/sec Might try overclocking in the future and go for 4Ghz [Edited by Maniacken [US-Distributed] at 2010-06-05 01:51:28]

[DPC] Mr. Aldi
2010-10-26 23:40:12

i7 920 d0 @ 3,99GHz: multiplier 21x (20x) @ 1,306 V (1,263 V) BCLK 190 MHz (133 MHz) QPI/VTT @ 1,125 V (1,175 V)
RAM: 1520 MHz (1066 MHz) @ 1,400 V (1,500 V), timing 10-10-10-24 (auto-defined)

Default values between ().Â Every thing else just default.

First I thought I had to raise RAM voltage.Â But even stock @ 1,5 V gave memtest86+ errors.Â So lowered to 1.4 and could raise SPD multiplier w/o problems.Â Fun to see that I almost did not have to raise CPU & QPI voltage.Â I might be able to lower voltages even more (didn't do any testing with that yet), or raise BCLK even more

2888 kpts/sec (just calculated from running client on normal priority)

Absurd that Intel charged me just 200 euros for what is actually a 965

Wauw I love that Mr.Â Green

We should have a graph of kpts/s/W

[Edited by [DPC] Mr. Aldi at 2010-10-26 23:48:41]

Stephen Brooks 2010-10-27 00:27:38	My home Athlon X2 4400+ (2.2GHz) got 368 kpts/s. That's the benchmark from March I forgot to report on... I now have one for my workstation too. Mr. Aldi's score is very consistent with Haiya-Dragon's i7 at 4GHz too.
[DPC] Mr. Aldi 2010-10-27 10:57:01	I will try to get past the 3168 kpts/s barrier later this week with a higher overclock and using muon1bench. Maniacken should use his hexacore to get to 4500 kpts/s Now I understand how he could be doing the whole high-res scan all by himself

Stephen Brooks
2010-10-27 16:30:59

I haven't done a graph in a while so have added the results from this thread (apart from Universal Creations who never replied with a more accurate number).

Just remembered that the i7 980 is 200x faster than my P-II-400 (which scored 16.75!) That difference is accounted for by 6x cores, 8.3x clock speed (=50x in all) and leaving 4x for architecture improvements including hyperthreading.

[DPC] Mr. Aldi
2010-10-27 23:45:44

I thought that processing power depends on amount of transistors times clock speed (or something like that):

980: 1.170.000.000
P-II: 7.500.000
-> 156:1

When clock taken into account, the 980 should be 1300 times faster

Clearly this doesn't work this way. The i7 transistors are quite lazy as we can see

Hyperthreading doesn't improve. Client only has to spawn more threads. Turning HT of would yield same kpts/s.

---

Something else. I spawned 8 instances with:

C:\>C:\windows\system32\cmd.exe /C start /affinity 1 C:\muon\1\muon1_cq.bat
C:\>C:\windows\system32\cmd.exe /C start /affinity 1 C:\muon\1\muon1_cq.bat
C:\>C:\windows\system32\cmd.exe /C start /affinity 2 C:\muon\2\muon1_cq.bat
C:\>C:\windows\system32\cmd.exe /C start /affinity 4 C:\muon\3\muon1_cq.bat
C:\>C:\windows\system32\cmd.exe /C start /affinity 8 C:\muon\4\muon1_cq.bat
C:\>C:\windows\system32\cmd.exe /C start /affinity 10 C:\muon\5\muon1_cq.bat
C:\>C:\windows\system32\cmd.exe /C start /affinity 20 C:\muon\6\muon1_cq.bat
C:\>C:\windows\system32\cmd.exe /C start /affinity 40 C:\muon\7\muon1_cq.bat
C:\>C:\windows\system32\cmd.exe /C start /affinity 80 C:\muon\8\muon1_cq.bat

v4.44e client (configured to use 1 thread in config.txt). Gave me 1240 kpts/s. CPU load of 99-100%. Every client contributed for 1/8th +/- 2% to this total. So we can compare when running 1 simulation:

Then just 1 client 8 threads. CPU load of 90-99%, avg 96% (guess, at least little less), but got 1350 kpts/s. A significant improvement, where I expected the same or smaller number.

Funny

Xanathorn 2010-10-27 23:59:32	I got an amd hexacore (1090T black edition) running at stockspeed at the moment. I'll see coming weekend what I can press out of the cores.
Stephen Brooks 2010-10-28 02:12:27	--[I thought that processing power depends on amount of transistors times clock speed]-- Only if you know how to parallelise all x86 code by an arbitrary amount, which is impossible. It does work that way when you create entirely new cores, provided the applications still scale. Also, if all the i7 980X transistors were active at once, the CPU would probably dissipate about 800W and melt. --[Hyperthreading doesn't improve. Client only has to spawn more threads. Turning HT of would yield same kpts/s.]-- Turning HT off would make the client spawn half the number of threads because it would see half the number of logical CPUs. I tested it and it does make a big difference, at least the 2004ish Xeon I tried. I need to ask Maniacken: how many threads does Muon1 spawn automatically on the i7 980? 6 or 12?
Maniacken [US-Distributed] 2010-10-28 06:35:25	Muon is spawning 12 threads, but it does go between 2, 5, 11, and 12 threads. I am also wondering why on the task manager my cpu usage only goes up to 93-95% use? I remember on other computer i have used in the past cpu usage staying at 100% Is there a way to force muon to not hyperthread and only use 6 threads? As in using only cores 2,4,6,8,10, and 12? Like setting a processor affinity?
tomaz 2010-10-28 07:26:25	HT does improve on i7, at least on 4 core. Take a look few pages back (date 2008-12-19). I got 1430 kpts (127.6 pts/Mclock/core) with HT off and 1960 kpts (175 pts/Mclock/core) with HT on. CPU was i7 920, factory settings. OCed at 3.55 GHz it went to 2552 kpts (180 pts/Mclock/core). I think pts/Mclock/core numbers really tell you something about architecture of CPU. i7s are fascinating. It takes 8 Opteron cores for 4 i7 cores, as we can see from graph above.
[DPC] Mr. Aldi 2010-10-28 13:45:56	@ Maniacken. You could turn of HT, but see comment Tomaz. You would expect that 6 threads then can load the CPU more efficiently (less overhead) If, with HT on, you'd set 6 threads in config and assign affinity to six cores, then so far I have tested, you only get it loaded for 50%. Now sometimes 13% in task manager actually means 25% (using one core), for muon1 this doesn't apply. You will get half the kpts/s. I tested 8 clients with one thread. Load of 99 to 100%, but less efficient than 1 client with 8 threads (with lower CPU usage according to taskmgr). @Tomaz i7's are fascinating indeed! Also if we look at overclock potential w/o raising voltage. Intel made processors that also compete on price with AMD now. AMD always used to be a good bang for the buck
Stephen Brooks 2010-10-28 14:47:40	I'm not sure I totally trust the way that task manager reports CPU usage for the "extra" logical cores that HT produces. 93% may be as high as it can go, since the "two" HT cores are actually only sharing the resources of one.
[DPC] Mr. Aldi 2010-10-28 23:38:05	Task manager cannot be completely trusted. Tests here show that the best thing is to just leave the client alone There are benchmarking programs that will stop your system to respond (slow mouse etc). Muon1 is not able to do that (only a lill' bit if set to realtime, which does not help kpts/sec though), so there is clearly some processing power unused. But at least my system responds as I need
[TN]marvik 2010-11-16 19:23:26	If you still would like some benchmarks, here's an Intel Xeon X5650 @ 2.67GHz: 2958856,525118.4,4410.41 2959156,526123.6,4401.99 2959456,528578.9,4431.77 2959756,528693.4,4400.13 2960056,531049.7,4426.90 2960656,533398.9,4419.09 2961256,535638.2,4408.76 2961556,537203.5,4414.79 2961856,538416.7,4412.04 2962456,540770.6,4404.90 2962756,543116.2,4429.64
runesk 2010-11-16 20:18:25	marvik, that's an DUAL X5650
[TN]marvik 2010-11-16 20:59:17	Yes, that's right. Good you noticed
Stephen Brooks 2010-11-17 17:02:11	--[If you still would like some benchmarks]-- I always want benchmarks, despite users' habit of making me rescale the Mpts axis every few months
HaloJones 2010-12-18 10:42:36	--[I'm not sure I totally trust the way that task manager reports CPU usage for the "extra" logical cores that HT produces. 93% may be as high as it can go, since the "two" HT cores are actually only sharing the resources of one.] Operating systems are extremely bad at reporting accurate cpu utilisation with virtual thread processors.The second thread is generally only available when the first one has stalled. If the first one never stalls, the second thread will never be used and the OS will think the cpu is at 50% load (assuming two threads per core). In reality, the core is as busy as it can be. I work with Sun's 8-threaded servers and it is a nightmare trying to determine whether the server has loads of headroom or is absolutely maxed out. The OS is useless and without low-level diagnostic tools (thank God for dtrace) we'd have no idea what was actually going on.
[TA]Assimilator1 2011-01-18 15:14:59	[DPC] Mr. Aldi You shouldn't be using v4.44e it's slower than 4.44d & was pulled shortly after it's release many months ago. Your benchmark will be down because of it
Mezocop 2011-02-16 11:44:59	Phenom II X6 1055T @ 3511 MHz 1880 kpts/s.
K`Tetch 2011-02-25 05:00:04	Ok, a 'slow' one by the standards of the graph, but for completeness laptop with i3-380M (stock, 2.53ghz) on win7-64 So far, an initial run (new install of the client) 688 kpts/sec That's on short sims though, due to the new install.
K`Tetch 2011-04-18 22:33:00	Ok it was selling things short on long runs it's averaging 851.1kpts/sec thats fully stock and with ddr3-1066 ram on win7-64 using v4.45
Stephen Brooks 2011-04-19 01:40:33	Think I might need to start a new graph for v4.45 just to be safe.
K`Tetch 2011-04-21 02:08:01	well, having a fixed design that could be tested, and timed would be better, kinda like the high res scan The muon1bench already uses the system clock, so timing wouldn't be an issue. and having a single fixed design makes it easier to bench over several versions. Right now, because of the 300s intervals, you need at least 24hours to benchmark, and that's a long time to try and keep a system otherwise idle, for a true representation. A 300mpts sim would work, be long enough for the fast PC's (90 sec for the current graph-topper) and not SO long that it'd kill something like your atom (maybe 50mins going by the graph above)
iNSaNiTi 2011-05-18 17:02:42	Nice results with ur i7 920 [DPC] Mr. Aldi! I have a i5 2500k but is not so fast @ stock speed i think a well clocked phenom x4 could be fast as i5 2500k. I bought the wrong cpu phenom x4 or i7 = better !!! What do u guys think ?
iNSaNiTi 2011-05-18 17:03:34	ps : still i got the fastest single core
Maniacken [US-Distributed] 2011-05-19 15:35:00	Stephen did you want to start a 4.45 graph? I can rerun the i7 and we can get accurate speed comparisons.
iNSaNiTi 2011-05-22 13:29:03	My SB rig : i2500K @ 4Ghz 1,17V (undervolted)/p67-GD65/4Gb 1600mhz Benched @ : 66231,226871.7,2309.59 67131,228334.1,2052.80 72831,241272.5,2205.47 73431,242377.4,2180.36 74031,244296.8,2246.06 75231,246679.2,2216.25 76131,248441.3,2195.85 76731,250246.6,2236.49 77031,250466.4,2199.81 77931,252848.6,2230.28 78832,254455.5,2201.88 79432,255842.7,2206.37 80332,257933.5,2213.09 80632,258350.6,2197.56 81232,260270.4,2233.99 Cya
Stephen Brooks 2012-03-29 16:43:47	Hmm, on the graphs like this one where I calculate the number of "GHz" on the project from the Mpts/sec rate, I use a fixed conversion factor that is meant to represent the average CPU currently in use.Â Right now, that conversion factor is set at 0.095 Mpts/s for each GHz (multiplied by the number of cores). That is, 95 kpts/s/GHz shown by the hollow grey bars on my benchmark graphs.Â This corresponds roughly to the Core 2/Phenom generation of processors, but the Core i7 chips on our graph average 175-180 on this efficiency measure.Â The best other chip is a Phenom II around 120. Maybe I should start a v4.45 graph but I'm not totally sure which of the older results were from that version.Â v4.45 came out on 2011-Mar-25, which should narrow it down a bit.