|I came across this interesting article on PhysOrg:|
The following bit looks like it might have some application to this project:
[i]But new technologies that capture enormous amounts of data – human genome sequencing, Internet transaction tracking, instruments that beam high-resolution images from outer space – have opened opportunities to predict discrete “high dimensional” or “high-D” unknowns. The huge number of combinations of these “high-D” unknowns produces enormous statistical uncertainty. Data has outgrown data analysis.
This discrepancy creates a paradox. Instead of producing more precise predictions about gene activity, shopping habits or the presence of faraway stars, these large data sets are producing more unreliable predictions, given current procedures. That’s because maximum likelihood estimators use data to identify the single most probable solution. But because any one data point swims in an increasingly immense sea, it’s not likely to be representative.[/i]
From what little I understand of the project, it seems each simulation represents a point in n-dimensional space. So a simulation that consists of, say, 48 different parameters ranging from 0 to 999 represents a single cell in a 48-dimensional hypercube consisting of 1000^48 = 10^144 individual cells. This is an insanely huge number and even if we collect a million or even a billion results it's still just a drop in the ocean. All the simulations we run are likely to be surrounded by vast swathes of unexplored space and are also likely to be in some way atypical so that it is difficult to extrapolate useful data from them.
Hopefully the new methods the guys from this article might be able to be adapted so that we can predict juicy high-yield simulations to run.
|You're right that the space is large, but I'm not sure the analogy from statistics to optimisation problems quite holds in this case.|
In the high-D statistics problem, you have some data, some empirical laws, and a very large space of "possibilities". You can then assign each possibility a likelihood (kind of a "relative" probability of it generating the data if it happened, relative to the other possibilities). So you get this surface of "likeihood" if you plot a graph, and in a way it is a lot like the surface of yield in Muon1, there'll be a very small area where it is high and possibly local maxima scattered around in a complicated way.
However in Muon1 we really ARE interested in the maximum! In statistics you just want to, given the likelihood surface that arises out of your incomplete data, pick a point somewhere that is your "best guess" for the true situation. You can do that by taking the maximum likelihood (which, incidentally, assumes you've already solved the associated high-D optimisation problem). I think these guys pointed out that if there's a maximum likelihood over here and a near-maximum a long way away, just picking one of them is unrepresentative. In reality it's a complex situation and picking just a POINT guess at where something is is never a proper description of our real uncertainties etc.
However, maximum likelihood (and other estimators) can be applied to the problem of fitting a "model" through the data that we have collected. The likelihood space is then in the model parameters rather than the actual optimisation parameters. Whether there's any particular problem with using the standard estimators (MLE) on these model-fitting problems, I doubt, because last time I looked at one, the likeihood optimisation was just a paraboloid with a single known maximum. Using this maximum doesn't seem a problem. I think what they are talking about is for nonlinear models - and I mean nonlinear in the model parameters not the optimisation parameters (any model more useful than gradient estimation is nonlinear in the optimisation paramters).