stephenbrooks.orgForumMuon1GeneralRecalculation of high scores shows lower yields
Username: Password:
Page 1 2
Search site:
Subscribe to thread via RSS
John Kitchen
2002-11-11 18:39:57
Recalculation of high scores shows lower yields but we must do it

I am interested in the amount of variance the simulation shows from one run to the next, especially for some high scores.  So I have been devoting most of my recent CPU time to verification of previous work units. 

When a work unit with a high score is recalculated, one typically sees lower yields as a result.  This is quite normal.  Why?  Because in order to get a high yield, you have to be doubly lucky.  You need a lucky high-yield design, and, on top of that, you need luck in the way the particles flow in this simulation run.  (You might call this "stochastic luck").  When you get both of these, you get high yield.  smile

Here is a way to picture it.  Imagine a plot of many simulations of the one design, a frequency chart showing the frequency of simulations hitting bands of yield values.  It looks like a bell-curve (let’s say approximately a “normal” distribution).

If we deliberately choose the highest yield simulations, it has a yield that is to the right of the average, in fact maybe (I am guessing), up to 3 or even more standard deviations to the right of the mean.  So it is quite normal that re-simulating that (very) high yield design will result in lower yields (generally) than the original yield!  This is EXACTLY what I am seeing.

Recalculation of a particular work unit 36 times mad showed a standard deviation in the Yield of about 0.02, and the range from minimum to maximum is less than 2.8%. That’s maybe what one might expect as a result of stochastic effects.

But there is no similarity between these recalculations and the original yield.  frown The original yield was 3.555411. Recalculation showed nothing over 3.021, with an average of about 2.978. The original is 28 standard deviations from the mean of the recalcs.  Statisticians will tell you that is just too far for credibility.  Take a look at this CHART.

You will find this work unit currently at position #3 on the rankings!  I have recalculated a couple of other very high yield work units and seen the same thing.  The yield drops to about 3. Give or take a bit.

So here are my conclusions and recommendations

1. the very hi-scores are probably wrong (for whatever reason) and should not be published
2. the record-breaking yields that are being seen are an overstatement (higher than what one would expect on average for that design) due to stochastic error
3. all yields should be recalculated prior to publication (to eliminate the above two effects)
4. we need a project mechanism for re-distributing the promising designs for recalculation and giving credits to people for this reprocessing effort.

Combine these changes with obstacles for the cheats and we will have a GREAT DC project.  I hope this analysis helps us to improve.

John
BrettJB
2002-11-11 21:32:49
I'm probably just going to put on a spectacular display of my own ignorance here, but it seems to me as though there are too many variables.  Wouldn't it make more sense to keep the particle "load" consistent, if what we're trying to do is optimize the physical design? 

I understand that the particle flow will vary in real life, but since we are simulating here, shouldn't we hold something constant, while varying something else?  Work with, say, an "average" particle flow and optimize to that... or am I just not understanding how the simulation works?

As I said, I'm certainly no expert on simulations, so just tell me if I've got my head up my @$$ (or if I'm merely simulating that!) wink

--Brett
Midon
2002-11-11 22:55:42
The point is, that in reallity you will have also this variations.  The creation of the particles in the target will have statistical variations.  Also the decay of the pions is more complicated because the pion decays in a muon and neutrino or anti neutrino.  This means that not only the time when the decay happens is statistical also the direction and the energy (!) of the muon varies.  You try to simulate the physical laws.  Ony important difference is that you cannot simulate the real number of particles because it is much too high.  This will give you a higher statistical variation than in reallity.
Stephen Brooks
2002-11-11 23:46:01
The highest scores are wrong due to a (known) intermittent bug in the program.  It's possible that I've already solved it along with the other bugs I've solved while working towards v4.22. I will have to watch carefully the results that appear after that new version.  For now, the best250 already has these rogue results removed from it, or at least the grouping of anomalously high ones obesrved on a chart like that.


"As every 11-year-old kid knows, if you concentrate enough Van-der-Graff generators and expensive special effects in one place, you create a spiral space-time whirly thing, AND an interesting plotline"
DukeBox
2002-11-12 01:49:21
quote:
3.555411
Hey thats's mine.. so i'm very lucky ? 
John Kitchen
2002-11-12 03:46:37
quote:
Originally posted by Stephen Brooks:
The highest scores are wrong due to a (known) intermittent bug in the program.  It's possible that I've already solved it along with the other bugs I've solved while working towards v4.22. I will have to watch carefully the results that appear after that new version.  For now, the best250 already has these rogue results removed from it, or at least the grouping of anomalously high ones obesrved on a chart like that.


"As every 11-year-old kid knows, if you concentrate enough Van-der-Graff generators and expensive special effects in one place, you create a _spiral space-time whirly thing_, AND an interesting plotline"


Sure, some known anomalies have been removed from the best250, but they have not been removed from rawstats nor from the official stats on http://www.stephenbrooks.org/muon1/. mad

What I am saying is that we need a mechanism to eliminate high scores which are a consequence of the known bug in the client, or strange numbers induced by overclocking or by overzealous editing.  wink

As every 11-year-old kid knows, when 4.22 arrives, we still have a very large database of results, some of which are plain wrong.  frown

Participants in this DC project typicaly have one or two of the following goals.  First to clock up lots of particles and be recognized as the largest contributors.  Second, to make the highest contribution to the "science" of the project, and come up with the highest yield as a measure of tha best design.  In either case, they want to see their name up in lights, and true credit given.  big grin In BOTH cases, these results are, today, compromised.  mad mad

Of these two, at least it is possible to fix the second by a simple recalculation of the result prior to publication of high yields.  And that ENSURES that weird results don't make it into best250, which further helps the science objectives.  Right now, we have NO idea of the quality of the data that is seeding simulations in best250.

I, for one, am perfectly willing to provide computing resources to recalcuate high scores prior to their publication and inclusion in the best250. I care more about the science goals of this project than the BS statistics of the "Total particles simulated". Having looked at the stuff that is passing through the FTP site, this statistic is clearly somewhat meaningless.

The issue is the need for recalculation, Stephen!

John
John Kitchen
2002-11-12 03:49:27
quote:
Originally posted by DukeBox:
quote:
3.555411
Hey thats's mine.. so i'm very lucky ? 


No DukeBox, you are very UNLucky.  3.555411 is BS.  It is not achievable with that design.  3.555411 is the result of something BAD happening.
BrettJB
2002-11-12 04:34:14
My 2/100 of $1:

1) Even if there are some suspect data points, wouldn't they stick out when graphed (isn't that how Stephen decided which points to toss on the last "edit" of the top 250)?  We'd have a large cluster of good results (looks like those are currently in the 3.02-3.04 range) which should be at or near the optimal configuration, and we'd have some questionable results, such as my top score of 3.09xxxxx (15th!!!  woo-hoo!).  Knowing that there's a bug, in the software, one could either toss out the questionable results, or you could re-crunch them as John suggested.  I hope that there's no need for re-crunching the entire database of results.  Or is the bug more pervasive than that?  My feeling is that we've got a large database of perfectly good results, with just a few bits of "noise" scattered throughout.  But that's the opinion of someone who doesn't know the exact nature of the bug that Stephen mentioned...

2) John mentioned overclocking as a potential cause, and we've also a known intemittent bug in the software.  Certainly, overclocking a processor beyond its rated speed does offer the potential for Bad Thingsâ„¢ to happen, but is this software more susceptible to that than others?  Two of my rigs are mildly overclocked-- T-bird 1.2 -> 1.4 and a K6-2 400 -> 450 (woo hoo!).  Both run as stable as my non-overclocked machines (i.e. the only reboots are for software updates that require reboots).  But if overclocking is the cause (or one of the causes) I'd like to know so I can either dedicate those machines to projects that aren't as sensitive to overclocking, or back them down to rated speed and continue with DPAD.

Higher scores are nice to have for the recognition factor (have I mentioned yet that, as of the last stats update, I've got the 15th highest yield?  wink ) but not at the cost of the science.  Who wants a high score on a project where the results themselves become suspect?  *cough* SETI *cough* wink

--Brett

[This message was edited by BrettJB on 2002-Nov-12 at 11:47.]
DukeBox
2002-11-12 04:59:55
quote:
Originally posted by John Kitchen:
No DukeBox, you are very UNLucky.  3.555411 is BS.  It is not achievable with that design.  3.555411 is the result of something BAD happening.

-> <-
DukeBox
2002-11-12 05:01:51
quote:
Originally posted by BrettJB:
My 2/100 of $1:
2) John mentioned overclocking as a potential cause,


My 'bad' score was done on a 200mhz p1 cpu.  No overclocking (it's a laptop).  It was a clean install witch the best 1000 results file.  The funny thing is that is was the first point it made.
Herb[Romulus2]
2002-11-12 06:28:19
quote:
It was a clean install witch the best 1000 results file.  The funny thing is that is was the first point it made. 

Very odd, my long time leading best result 3.029xx was done the very same way confused

-------------------------------
Y0k compliant, counting upwards wink
astra412
2002-11-12 07:24:07
My top score, the 1st place 3.586%, was also created directly after uploading the best 1k. ????  Going back to statistics and what can be inferred from them it would seem there is a common thread among some of the oddly high top muon retentions.

Astra412
BrettJB
2002-11-12 08:21:22
So since my top score was generated using only the top 250, and since it wasn't the first point I generated, does that mean mine is "good"?  wink

Nah, probably still suspect, as I believe it was created when we had the top250 with the "bad" results still in there. 

Stephen, can you share with us what you think the nature of the intermittent bug is?  It does seem as though there's a common thread of a brand new install, top 1000 file and first point generated...
John Kitchen
2002-11-12 09:45:04
quote:
Originally posted by astra412:
My top score, the 1st place 3.586%, was also created directly after uploading the best 1k. ????  Going back to statistics and what can be _inferred_ from them it would seem there is a common thread among some of the oddly high top muon retentions.

Astra412


IMHO, if the particles count is uncharacteristically high (e.g. 70K), the result is suspect.  frown

Astra412, if you want your work unit recalculated, send it to me and I'll crunch it say 10 times and post and/or email the result.  Your choice.  Click on my name for the email address, and make sure you fix it removing the antiSPAM wink

Just to be sure, it would be good to have MULTIPLE independent recalculations on different recalculation clients, so anyone else with a modified client (modified to recalculate old work units) is encouraged to enter this discussion big grin
astra412
2002-11-12 10:12:34
Hi,

I have two I would be interested in having recalced if you're willing.  I can send you the top one if you think that it is worth it, but it was clearly created by the program glitch.  A more interesting result my computer has generated is an 80k particle result with only a 3.25*% retention, which was also created soon after uploading (and modifying) the new b250. Given the way the #particles/%retention ratio goes this result seems more realistic, though I have no idea what the theoretical limit is.  From just paging through my results there seems to be a <~60k/3.0**% limit on the program.  What info do you need me to send?  I'm assuming all I need to do is copy and paste the line from the results.dat file?

Thanks,

Astra412
DukeBox
2002-11-12 10:19:09
Is is easy to re-crunch the point ?  Else i can offer a server to recrunch all the points for validation.
John Kitchen
2002-11-12 15:51:57
quote:
Originally posted by astra412:
Hi,

I have two I would be interested in having recalced if you're willing.  I can send you the top one if you think that it is worth it, but it was clearly created by the program glitch.  A more interesting result my computer has generated is an 80k particle result with only a 3.25*% retention, which was also created soon after uploading (and modifying) the new b250. Given the way the #particles/%retention ratio goes this result seems more realistic, though I have no idea what the theoretical limit is.  From just paging through my results there seems to be a <~60k/3.0**% limit on the program.  What info do you need me to send?  I'm assuming all I need to do is copy and paste the line from the results.dat file?

Thanks,

Astra412


Astra412
Sure.  All you need to do is email me the lines from the results.dat file.  To put it in perspective, it takes my machine about 78 minutes to process a high-yield work-unit, and I prefer to do the same one a number of times so we see the spread.  BOTH of the work units you mention seem interesting, and I'd be happy to run the two say 7 times at least to start.  That would be about 18-19 hours CPU time. 

quote:
Originally posted by DukeBox:
Is is easy to re-crunch the point ?  Else i can offer a server to recrunch all the points for validation.


DukeBox
You just need a modified client that instead of doing its random things at the start, reads in a file containing a "result" and simulates that configuration, creating a new "results" output.  Several people have these, the one I use is not mine to give away. 
__________________________________________________

Finally...
I probably should not have mentioned "overclocking". Seti@home suffered somewhat from this, but THIS project has much different exposures.  The data passing through the FTP site has some VERY strange stuff.  For example, a file went through recently with THOUSANDS of workunits, and the MINIMUM yield was 2.9500 eek eek eek

The standard client chooses quite random designs 25% of the time, and this is why you will see at least a quarter of your results with very low yields.  (Below 0.6 typically.) In order to get a minimum yield as high as 2.9500 (a strangley round number for starters mad), the file must have been tampered with either by hand, or by non-standard software.  I am not saying "cheat", but the opportunity is there.  Someone could have manually removed all the low yield work units from their results.txt file, prior to submission and foregone all the credits they would get for processing them.  Yeah, RIGHT!  big grin

What I am getting at is that people ARE playing with the results.txt files, and I have a suspicion that some of this "playing" is quite sophisticated.  wink

THAT is why I think recalculation is important

John
prokaryote
2002-11-12 17:03:00
I agree that statistical verification of high results is mandatory.  Since the object of the project is to produce an efficient design, it is only these results that matter (in the end).  However, from a project stats aspect, which is the main motivation for DCers, there should probably be some sort of encrypted checksum to prevent artificial particle yields from inflating scores.


www.ninjamicros.com mathematical projects
pvs
2002-11-12 18:19:32
quote:
Originally posted by John Kitchen:
The data passing through the FTP site has some VERY strange stuff.  For example, a file went through recently with THOUSANDS of workunits, and the MINIMUM yield was 2.9500

The standard client chooses quite random designs 25% of the time, and this is why you will see at least a quarter of your results with very low yields.  (Below 0.6 typically.) In order to get a minimum yield as high as 2.9500 (a strangley round number for starters mad), the file must have been tampered with either by hand, or by non-standard software.  I am not _saying_ "cheat", but the opportunity is there.  Someone _could _have manually removed all the low yield work units from their results.txt file, prior to submission and foregone all the credits they would get for processing them.  Yeah, RIGHT!  big grin

What I am getting at is that people ARE playing with the results.txt files, and I have a suspicion that some of this "playing" is quite sophisticated.  wink

John


Only an example how you get result files containing only 2.9xxx and higher.
I'm doing following with my resultfiles.
I open the results.txt take all results <2.9 and put them in results_to_send.txt file, then put all results >=2.90000 in a file called bunker.txt.
store the results_to_send.txt in a folder "sent" an then store it as results.txt in a folder called "copy" where I have a copy of all muonfiles and send it with manualsend.
Just for my personal joy, i try to stay, lets say on position 75 of the stats, if someone passes me, I use the results in bunker.txt, if the lower results are not enough.
But I do not have thousands of these results.  (somone with a lot of boxes could have)

Peter
Matai
2002-11-13 01:04:25
Hi,

quote:
--------------------------------------------------------------------------------
Originally posted by John Kitchen:

The standard client chooses quite random designs 25% of the time, and this is why you will see at least a quarter of your results with very low yields.  (Below 0.6 typically.) In order to get a minimum yield as high as 2.9500 (a strangley round number for starters ), the file must have been tampered with either by hand, or by non-standard software.  I am not _saying_ "cheat", but the opportunity is there.  Someone _could _have manually removed all the low yield work units from their results.txt file, prior to submission and foregone all the credits they would get for processing them.  Yeah, RIGHT! 

What I am getting at is that people ARE playing with the results.txt files, and I have a suspicion that some of this "playing" is quite sophisticated. 

John
--------------------------------------------------------------------------------

If I didn't get it wrong, the main goal is to reach a maximum result.  I throw away all results lower than 1 and forget about the credits.  Yeah, RIGHT!

I even stop processing when I see that a top result will never be reached (let's say less then 10.000 at 20.00 ns, mostly randoms).  Here I'm loosing credits again.
But that's the reason I got a high credit even with a low number of results.

Sending only non zero results is probably the reason that I didn't get any duplicates until now (?)

Is that cheating?

Bernd
Stephen Brooks
2002-11-13 01:54:28
I want to find a way to do this checking without doubling _all_ of everyone's work.  How about the program saves a result in limbo if it happens to be the highest found so far on that machine (or maybe one of the highest three or ceil(N^0.4) where N is the number of results in results.dat) and recalculates it one (or two) times, then only logs the _lowest_ of these 2 or 3 scores to results.txt and .dat when finished?

I'd prefer to track down this bug (and there's another bug in 4.22 right now that's related and I'd like to kill off), but I suppose a re-checking system for the higher results would be a good way of sending the probability of a wrong result from 1 in 10000 to 1 in 10000^2 (needing 2 wrong ones in a row to leave an error.

I might do a big graph of particles vs. score and eliminate the rougues so far entirely from the database.
Shall I add this re-checking feature to v4.22?  It shouldn't be totally hard to do, and would mean you'd sometimes get a new file in your dir where the muon program is storing a "suspect" result.  I'll also put some simple scrambling on this file so people can't rename it to results.txt and send it.

As for the person sending only results over 2.9500%: you've seen the number of "Muon proxy" programs about currently... Many of these have results-sorting and filtering features.


"As every 11-year-old kid knows, if you concentrate enough Van-der-Graff generators and expensive special effects in one place, you create a spiral space-time whirly thing, AND an interesting plotline"
David
2002-11-13 03:52:37
The problem with picking the lowest is the same as the problem of picking the highest - it may be on the other tail of the distribution.

On the other hand, it should be clear that, in the absence of the bug manifesting itself, the width of the distribution is quite narrow, suggesting that your simulation parameters are appropriately chosen.

I suggest that checking of very high values as desribed at the start of this thread should be sufficient.

How about the following:

Step 1 - As currently, a candidate design is generated randomly and the results returned.

Step 2 - If a new highest value is returned, cross-check it using a "trusted" simulator.  This may involve multiple runs, so perhaps an alternative netwok should be used.  Credit may be given for this processing if it is distributed to external users.

Step 3 - If the highest value is verified (using suitably vague criteria) , then add it to the top list.  Clearly, there may be a delay whilst the verification processing is conducted.

Step 4 - If not verified, contact the supplier and enquire as to how the value was generated.  Educate or sanction as necessary.
Stephen Brooks
2002-11-13 05:18:35
No that sucks.  smile I'm not setting up an entirely different DC project just to do the checking, it's far easier (from my end) to have it built into the client program.  I'm about to do a big graph of the global particles vs. percentage distribution so I can see clearly what region should be removed.

Your comment about "taking the lowest" is valid - it would mean we could occasionally have a high result "buried" in a dat file because one run gave it a -0.1 score fluctuation.  Roughly speaking, taking the min of 3 normal distributions triples the P in the lower tail (so 1 in 1000 becomes 1 in 333) but CUBES the P in the upper tail (so the old 1 in 1000 point becomes 1 in 1 billion!).

Anyway I thought about it and I think that if I take a result, check it 2 extra times (so 3 values) and then use the MIDDLE value I'll get very-shortened tails at both ends (in my example, both 1 in 1000 points would be replaced by something of the order 1 in 1 million).  Out of 4 repeats, I'd pick the second-lowest, as the upwards randomness is more harmful than the downwards.


"As every 11-year-old kid knows, if you concentrate enough Van-der-Graff generators and expensive special effects in one place, you create a spiral space-time whirly thing, AND an interesting plotline"
astra412
2002-11-13 06:43:05
Hi,

I like this idea a lot.  I'd rather know the results my computer is producing are valid than get flashing lights next to my name.  However, this doesn't preclude people from cheating unless you have some way to know that a result has been re calced.  Additionally, people may simply delete these high results in the interest of getting those 3 extra runs.

Just a thought;

Astra412
Stephen Brooks
2002-11-13 07:11:13
quote:
Originally posted by astra412:
However, this doesn't preclude people from cheating unless you have some way to know that a result has been re calced.


Well I said in my original post that I'd soft-encrypt the "pending verification" file, but actually all I need do is omit the checksums from that file and only have Muon calculate them when they have been fully tested.  smile

quote:
Additionally, people may simply delete these high results in the interest of getting those 3 extra runs.


That is true, though for now I could just triple the particle-count for verified results (cludge).  In v5 I planned to do the credits by particle-timestep, so essentially a function of the FLOP count.


"As every 11-year-old kid knows, if you concentrate enough Van-der-Graff generators and expensive special effects in one place, you create a spiral space-time whirly thing, AND an interesting plotline"
Stephen Brooks
2002-11-13 12:39:45
Here's a chart with a bit more data on it.



Those blue points out of the top and a bit to the right are the rogues.  As you see they're quite rare compared to the run-of-the-mill results (the density colours are red for 1000 in one square going to white for >10000) and I should be able to get rid of them by applying a cutoff at about >65000 particles.  But I'll only do this once v4.22 is out with checksumming and re-checking of its own results, so no more rogues will appear after I've done it.


"As every 11-year-old kid knows, if you concentrate enough Van-der-Graff generators and expensive special effects in one place, you create a spiral space-time whirly thing, AND an interesting plotline"
astra412
2002-11-13 13:19:31
Cool!

Did you consider plotting a best fit line?  It looks like you could fit a regression with an asymptote between 65k - 68k very nicley into the scatter plot.

Astra412
John Kitchen
2002-11-13 15:10:25
quote:
Originally posted by Stephen Brooks:
Here's a chart with a bit more data on it.
*snip*
Those blue points out of the top and a bit to the right are the rogues.  As you see they're quite rare compared to the run-of-the-mill results (the density colours are red for 1000 in one square going to white for >10000) and I should be able to get rid of them by applying a cutoff at about >65000 particles.  But I'll only do this once v4.22 is out with checksumming and re-checking of its own results, so no more rogues will appear after I've done it.



Cool graph!  cool Does it seem like we are converging towards one design or a number of them?

Shortly I should have about 50 recalcs done of a known good design, and I plan to post a frequency chart showing the distribution of the yields.  The interim results look REALLY interesting!  big grin
Stephen Brooks
2002-11-13 15:36:00
I think we essentially have already converged, to that patch of designs that looks like the head of the "comet" pointing rightwards.  The elliptical shape of the comet-head is what I'd expect from a bivariate normal distribution, so I think the true current best design must be somewhere near the centre of that.


"As every 11-year-old kid knows, if you concentrate enough Van-der-Graff generators and expensive special effects in one place, you create a spiral space-time whirly thing, AND an interesting plotline"
John Kitchen
2002-11-13 16:43:52
OK, here is the frequency distribution of 42 recalculations of the highest score out of about 2,200 results.  The high score IS, as I guessed, about 3 standard deviations from the mean.  The reason why the high score is so far off, is because it is the highest score out of such a large sample.  The yield is in part high due to random effects, not just a "good" design.  I know that the original yield has not been tampered with.

One interim conclusion I would come to is that we might need to run more particles through the simulation to reduce the variance.

KParks
2002-11-14 00:08:45
John,

I agree with your analysis.  Not sure what you are doing to recalc but if you are interested in having someone else to recalc more data for you I'd be happy to help.  I find the distributions you are producing very interesting.
David
2002-11-14 06:49:32
quote:
Originally posted by Stephen Brooks:
No that sucks.  smile I'm not setting up an entirely different DC project just to do the checking, it's far easier (from my end) to have it built into the client program.  I'm about to do a big graph of the global particles vs. percentage distribution so I can see clearly what region should be removed.


Fair enough - it's your project , and implementing a double checking system as I described would be more onerous than the alternative you suggest.

GIMPS attempts to double check all results - I believe they try to use different h/w and/or s/w platforms where possible as well.  This is certainly the case where a possible "winner" has been found.

EVOCHESS checked good candidate players to more accurately evaluate their fitness by running them against different opponents in atrusted environment (ie back at base).

Note that GIMPS has enough data to claim that 1.8% of all returned results have been subject to arithmetic errors - whether caused by s/w or h/w glitches or cosmic rays is undetermined.  Their s/w thrashes the FPU on a processor very hard - does Muon1?  (I'm not a source code junkie - I don't do C/C++, but it seems likely.) It may be that a similar effect has occurred here.
John Kitchen
2002-11-14 10:28:53
quote:
Originally posted by KParks:
John,

I agree with your analysis.  Not sure what you are doing to recalc but if you are interested in having someone else to recalc more data for you I'd be happy to help.  I find the distributions you are producing very interesting.


KParks

Right now I am recalculating a couple of designs that have huge yields to see if they still hold up.  (So far, they don't which is no great surprise.)

I am also re-doing one of the top designs from the best250, but this time, one of the more normal looking ones with about 59,000 particles.  This is just for personal interest.

Here is the IMPORTANT bit.  cool

I think it is worth while recomputing the best250 to refine the yields and make sure that the "top" ones are at the top because they are really good designs, not just that the single run that computed it was really "lucky".


Such a "best250" file could REALLY refine the mutations.

But this is a massive task even if each one were only re-computed 3 times.  And it requires co-ordination.

For this we need

1. An OFFICIAL "Recomputing client". Hello, Stephen wink How about it?  big grin
2. An agreed plan for recomputation (how many times to recompute, what we do with the results, who calculates which out of best250, and how the results get collated into a new "best of" file and so on)
3. Quite a few volunteers like yourself smile to do the crunching.

I have a draft plan in my head, and if there is demand, I would be prepared to document it, but we need Stephen's agreement to make it worth while, and to ensure that we retain the integrity of the science objectives.

As an indication of interest, how about someone starts a new thread called "Volunteers" or something, and record your commitment of ability to recompute measured in estimated work units per day.  If no-one starts it, I'll go back to sleep.  If they do, I will add my offer of at least 18 WUs per day on my MP2100. razz

John
ZeRo_DiViDe[DPC]
2002-11-14 12:22:32
I have a 3.09 score pending on one of my computers ... does that mean I have the highest LEGAL (probably) score?


MWah haha big grin

cool
Chuck Coleman
2002-11-15 05:20:29
This is where we need some statisticians.  The best results are really extreme values.  We need a statistician to work out the math to determine which values should be further explored.

This is really Phase I of the project: identifying candidate "optimal" parameters.  Phase II consists of replicating the simulations with the selected "optimal" parameters to get a better idea of how they perform.  Again, statistical advice will be needed.  Phase III would then apply GA to the results of Phase II to see if even better designs exist.  Phase IV would replicate Phase II for the results of Phase IV.

Phases II-IV would actually require less computational work than Phase I, but a lot more conceptual work.

cool

"Sorry, no concluding witticism"
John Kitchen
2002-11-15 15:25:24
It's DEFINITELY a duck, not a goose...
But is it a mallard?



[This message was edited by John Kitchen on 2002-Nov-15 at 23:50.]
[DPC]Stephan202
2002-11-15 15:30:49
LOL @ John Kitchen big grin

---
Dutch Power Cow.
MOOH!
Chuck Coleman
2002-11-21 17:12:35
As I said earlier, this project needs a statistician.  So, here I am!  smile

DPAC is an example of global optimization in a very high dimensional space (>100 dimensions).  Because of the curse of dimensionality it is impossible to conduct an exhaustive search, unless, of course, you're willing to wait several lifetimes of the universe.  wink Thus, we have to resort to an intelligent way of finding the global maximum, while bypassing local maxima on the way.  Genetic algorithms (GA) are an example of a method to handle this type of problem.  Normally, the objective function is known, but not amenable to deterministic optimization methods.  GA, if done properly, most of the time will converge to the global maximum.  In this project, the objective function is unknown, so it is estimated by simulation.  As has previously been pointed out in this thread, the values reported in the 'best250' file are extremes and not good estimates of the efficiency of a design.  However, this does not mean that this exercise is a waste of time.  In fact, there may be enough data to call off this phase of the project.  eek

Assuming that either the data (i.e., simulation results) are homoscedastic (equal in variance) or that variance increases in efficiency, then the 'best250' file is an excellent guide to finding an optimal (not necessarily the optimal design.) The trick is to use all of the simulations that have been conducted in a neighborhood of the 'best250'. A neighborhood is created by simply, for each dimension, placing bounds above and below the minimum and maximum of the 'best250' for that dimension.  (Of course, the minimum or maximum may coincide with a bound, so that value is used.) Then, the space is partitioned into hypercubes, so that each cube has at least 30 datapoints.  Where the range is small, that dimension is collapsed: all of the values in the neighborhood for the dimension is used.  The partition itself is generated by dividing each dimensions (expanded) range into smaller pieces of the form [ low, high ], where low and high are integers as used by the project.  The distance between low and high can be something like 10. Then, for each hypercube, the mean is computed.  The means are then sorted, the highest is then t-tested against successively lower means until some subset is obtained that is statistically indistinguishable from the best hypercube.  If the data are plentiful enough, these hypercubes can be subdivided and the analysis repeated.  Once this is done, simulations of equal size (say, 100,000) particles are done throughout the hypercubes and the partitioning and comparing process is repeated until an optimal design is obtained.  If there are already enough data, this last step will be unnecessary and the project can be halted.

The analysis assumes that the data are homoscedastic (same variance) within hypercubes, and that there are no duplications and cheats.  Furthermore, bias can be expected if generators of low efficiency results withhold their findings.

For many dimensions, it appears clear that the optimal neighborhood has been found.  Other dimensions have wide ranges, indicating that they have yet to be pinned down.  Still others exhibit evidence of potential outliers (a.k.a. "rogues"): the minimum or maximum is far from the mode (and mean and median).  These are true outliers, evidence that the value of a particular dimension is not important, or evidence of a better solution.  Only proper statistical testing can tell.  These last two cases are really the interesting problems.  Once all is said and done, only a close to optimal design can be obtained.  Whether it is good enough is up to the Neutrino Factory scientists to tell.

The statistical analysis, including means, medians, modes and ranges for each dimension can be found at [URL= http://geocities.yahoo.com/chuckcoleman/analysis.xls] http://geocities.yahoo.com/chuckcoleman/analysis.xls[/URL]. The Gini coefficient would be a useful indicator of potential outliers, but I can't compute it in Excel.

A spreadsheet version of the 'best250' can be found athttp://geocities.yahoo.com/chuckcoleman/results.xls.

Stephen, I'm available for doing statistical analysis.  Do you have SAS?

:cool

"Sorry, no concluding witticism"
Stephen Brooks
2002-11-21 18:23:52
There is a formula for doing smooth interpolation of noisy data like this, but it requires a sort-of "blurring radius" that depends on how dense the results are and how smooth you want it.  I'll do the analysis probably when the whole 4.x series finishes... I would e-mail you the database but it's rather large (800M.

If I understand what you're talking about correctly, you're saying restrict attention initially to results that have all their parameter values between the highest and lowest value of the particular parameter present in the best250. Then do some averaging over nearby results and search around for the smoothed maximum.

I don't have statistical analysis software, but I could write something to filter out all the results that are in the hypercuboid-bounding-the-best250 region you talk about.  Maybe they'd be few enough to send to you somehow.


"As every 11-year-old kid knows, if you concentrate enough Van-der-Graff generators and expensive special effects in one place, you create a spiral space-time whirly thing, AND an interesting plotline"
John Kitchen
2002-11-22 04:08:29
quote:
Originally posted by Chuck Coleman:The statistical analysis, including means, medians, modes and ranges for each dimension can be found at [URL= http://geocities.yahoo.com/chuckcoleman/analysis.xls] http://geocities.yahoo.com/chuckcoleman/analysis.xls[/URL]. The Gini coefficient would be a useful indicator of potential outliers, but I can't compute it in Excel.

A spreadsheet version of the 'best250' can be found athttp://geocities.yahoo.com/chuckcoleman/results.xls.



Chuck, you have piqued my interest, but the links don't seem to be working?  frown John
Chuck Coleman
2002-11-22 10:47:33
Stephen:

I think you understand me...I'm really proposing a kind of smoothing on the assumption that the global maximum is in the neighborhood of the best250.

I don't know what smoother you're talking about.  I'm familiar with nonparametric density estimation, which can be feasibly done in a maximum of about two dimensions.  Again, it's a several multiples of the lifetime of the universe.

John:

Sorry about the bad links.  The correct links are Results Spreadsheet and Analysis Spreadsheet.

"Sorry, no concluding witticism"

"Sorry, no concluding witticism"
Chuck Coleman
2002-11-22 10:48:44
That is, nonparametric density estimation in >100 dimensions would take several multiples of the lifetime of the universe.

"Sorry, no concluding witticism"
[DPC]Stephan202
2002-11-22 13:25:24
LOL.  Geocities keeps saying the files don't exist.  Probably because they are accessed trough an extern location.

I think this will work (well, it did here):
http://www.geocities.com/chuckcoleman/muon/

---
Dutch Power Cow.
MOOH!
Stephen Brooks
2002-11-23 06:53:52
On any metric space where you have a "function" (perhaps with randomness added) sampled at points f(x_i)=y_i, you can construct an smoothed-estimate of f(x) from the weighted average

sum{((d(x,x_i)+r)^-p)y_i}/sum{(d(x,x_i)+r)^-p}

i.e. weighting each value by (d(x,x_i)+r)^-p where d(x,y) is the distance between points in the domain, r is the "smoothing radius" (can be 0 for data that are known to be exact) and p is a power thing, (2 works and is fastest to compute).

This is just straightforward smoothing and won't get rid of outliers like our 'rogue results' or anything like that, but in regions of dense data (like our duck's beak) it will give a good approximation to the true function.
David
2002-11-26 04:26:35
Still can't download the spreadsheets - I can see the diretory, but IE says that they cannot be downloaded.

Can you check the permissions?
John Kitchen
2002-11-26 07:35:18
This posting will appeal to statisticians, amateur or otherwise.  roll eyes

At the link below, I have posted an Excel spreadsheet containing the results of eight designs re-simulated a total of 208 times.  One design was recalculated 96 times.  About 540 hours of CPU time is represented.  razz

Six of the designs were from the upper end of the best1000 list, the other two were personal bests from two different users.  Therefore, all original yields could be expected to be on the high side for the design.

Click or right click HERE

Have fun, John
Chuck Coleman
2002-11-26 16:01:17
David:

It looks like you have some sort of problem with IE.  I'm afraid I can't help you there.

Stephen:

What are your choices for d and r?  Why did you choose p = 2, besides computational convenience?  (I don't think p = 2 is necessarily bad, but I'd like to see some justification.) And over what neighborhood of x do you estimate E[f(x)]?  My proposal to compute unweighted averages on hypercubes is equivalent to p = 0 and has the advantage of being unaffected by heteroscedasticity (unequal variances).  Given that the simulations have varying numbers of observations, how are you going to account for heteroscedasticity in your weighted averages?  We can assume that each particle is independent and indentically distributed (i.i.d.), so that the variance in a small enough neighborhood is inversely proportionate to the number of particles.  I've also noticed a possible problem with bias.  The number of particles in a simulation is not constant: what is your stopping rule?  For example, I have a simulation with near zero efficiency using ~35,000 particles, while all of the best250 have over 58,000 particles, sometimes much more.  It looks simulations with low efficiencies are cut short, thereby biasing downwards the estimated efficiency.  For example, fix the number of particles and assume that the true efficiency is 3%. Suppose that a simulation is stopped halfway because its realized efficiency is 1%. If it were to continue to the end, the expected efficiency would be (1 + 3)%/2 = 2%. Thus, the stopping rule induces a bias of -1%. That is, using the simulation to estimate efficiency in a statistically sound manner would cause the true efficiency to be underestimated.  Bias runs in the opposite direction if "rogue" highly efficient simulations are discarded.
I would also appreciate a reference on the smoothing method.

John:

Good job on the simulations.  I noticed that the averages were all around 2.95%, so I ran t-tests on all of them.  I constructed t = (2.95 - average) / (standard deviation) and found the largest absolute value of t to be about .6. (Actually, I overstated t since I used the population standard deviation instead of the sample standard deviation [Excel function STDEVA].) The t-tests indicate that none of the designs has an efficiency statistically different from 2.95%. As I noted above, the unequal numbers of particles generates heteroscedasticity, which would normally invalidate simple t-tests.  However, the range of particle counts is so small compared to the number of particles per simulation that heteroscedasticity here is negligible.

The conclusion I draw from John's work is that there is a large plateau where efficiency is about 2.95%. To proceed correctly, the search has to find any peaks that may lurk on this plateau.  Global optimization generally has a difficult time here.  GA will simply dance around and require a bit of luck to find a peak.  Thus, I recommend either crunching the data to find a peak (using my hypercubes or Stephen's smoother) or the Gauss-Seidel algorithm.  I have written about the latter in the thread "more efficient algorithm possible.

"Sorry, no concluding witticism"
Stephen Brooks
2002-11-26 19:55:52
Your method basically equates to treating the points within any one hypercube as if they all came from the same design and then producing a more accurate estimate for the transmission for that hypercube's design from the average of these.  (And you can then apply all the usual stuff with confidence intervals if you assume normalness).

The main motivation for me using that smoother is not really statistical, but more that it turns the discrete data into a continuously differentiable function, and I can then use a standard analytical optimiser (e.g. steepest ascent) on that function to find the peak.

Your method also smooths the data, but the smoothing kernel is a hypercube rather than the function k(d)=1/(d^p+r) as with my case.  Probably with p=2 mine is a bit _too_ nonlocal and I could do with using a more compact kernel (it'd be faster to calculate too).

As it happens I get the impression most of the good designs are on this plateau now, and the plateau is the global maximum, so if I took the highest non-rogue design it would be within about 0.05% of the best transmission anyway (my employers are not at the stage of really caring about the last 0.05%).  Currently 4.x is running to test out the new programming-side stuff that's going to be put in and also has successfully demonstrated the multiple servers concept.  I have a new optimisation problem I'd like to set the muon project but I need to do some programming on it first (much of it is done, but more needs doing, and I'm still in university term, so have little time to do it).


"As every 11-year-old kid knows, if you concentrate enough Van-der-Graff generators and expensive special effects in one place, you create a spiral space-time whirly thing, AND an interesting plotline"
John Kitchen
2002-11-27 09:22:27
quote:
Originally posted by Chuck Coleman:
The t-tests indicate that none of the designs has an efficiency statistically different from 2.95%.


When I looked at the results of the recalculation, as a lay statistician at best, I was concluding that there was no real difference discernable in the top1000 since the particle count was so low as to result in high variances of run results.

Stephen:

How were you plannning on choosing designs to proceed with based on this project?  Also, I see that Chuck has some outstanding questions above and I am hoping they can be looked at.  I suspect he is on a good track here that can materially help this project.

Elsewhere you express concern about having higher particle count workunits, presumably because of the computation time.

I am sure there is room to move upwards in workunit time.  In the early days of seti@home, the average time per workunit was over 24 hours for many users.  So long as credit is given, the users will be happy.  And, as time elapses, the average hardware gets faster.  In the early days of seti@home I had a high end computer which took about 14 hours per workunit.  Today, my high end computer crunches a hi yield muon workunit in less than 1.4 hours, so clearly, the computation per workunit could rise by an order of magnitude and still fit with past precedent.

The serious contributors to this project leave their machines on 24 hours a day because of the project and personally, so long as my slower computer was making a submission at least every 72 hours, I would be happy.

The side benefit of this is reduced traffic at your end.  It seems that you have no support at this time from the core project people, and reducing the bandwidth needs at your end will allow you run this as a private project for longer.
John Kitchen
2002-11-27 10:59:03
Stats spreadsheet updated.  Now one design has been redone 114 times.  Click or right click HERE

Sample chart below.  If you look closely, you will see another duck, but it works best if you turn your monitor upside down.  wink

: contact : - - -
E-mail: sbstrudel characterstephenbrooks.orgTwitter: stephenjbrooksMastodon: strudel charactersjbstrudel charactermstdn.io RSS feed

Site has had 25392605 accesses.