|Since running the Linac900Ext1Xc2_nosample lattice my client hangs atleast once a day. |
Specifically the commandline client hangs, usually when rechecking a result. The cursor stays after the Mpts and there's no movement in numbers at all. Odd thing is that there's no error message, popup or whatever. It seems the last 24 hours my client hung without me noticing.
I just started the background client to see if it got that behavior too.
Platform: windows 7 ultimate, 64-bit
Processor: AMD 1090T, not overclocked
|The background client seems to have stopped at 4.30 am this morning, after running for just 2 hours. Seems like it's idling or something, processor use of the client stays at around 17%.|
I'll try a fresh install of the client in another folder and see what happens.
|Just checked and I'm running v4.46 here on that lattice, will make sure I'm running it at work too. At the moment everything looks normal. When it hangs, how many threads is it running and is the RAM usage in Task Manager abnormal (high, or increasing)?|
|It seems it hung after 2 hours again. It's 1 client running on 6 threads, memory usage is at 20232 kB and not increasing (staying steady at this number), cpu usage ~17%.|
There had been a power outage last saturday, though everything else's running fine I will check my harddisk for errors just to be sure. Will try to run muon from one of my other 2 harddrives and see what gives.
[Edited by Xanathorn at 2014-01-21 23:36:04]
|What I meant was, when it hangs, how many threads show in task manager? (Not how many did you configure it for). Your 17% number seems to indicate it's one core out of six spinning in an infinite loop. Could be some rare non-terminating condition, possibly to do with threading.|
Oh, you could try running it with 1 thread and see if that hangs? That would test if this is to do with the threading code. I'm afraid I'm snowed in with a rather slow netbook here so can't do a lot quickly.
|I will make 6 instances of the muon client, each using one thread and will check which one hangs first. That will point us in some direction maybe.|
|Looks like this latticedoes trigger the hangs again (like the old ones did for me). Maybe a possible fix might solve the other problem as well... |
[Edited by Zerberus at 2014-01-23 16:11:21]
|Now after 36 hours, all 6 instances are still running with each of them have about 200 results sent.|
|I've been running this multithreaded on the Xeon (including the 1Xc2_nosample lattice) and it's still producing results. So not triggered on every machine.|
|After 6 days all 6 threads are still running nonstop. I will try later today with 1 client using all 6 threads and see if the problem returns. |
If yes, it means having problems with multithreading on this lattice on my machine. I didn't have any of these problems in the past, last week was the first time it occurred.
|Just found some very weird behaviour - malloc(79) is returning NULL. I'm not out of memory, I can't find a double-free (or free of unallocated block). There must be some sort of buffer overrun that is corrupting what malloc is doing. This is on the MinGW compiler not LCC!|
 Resorted to trying to find the minimal input lattice that causes the error. Seems to be something to do with the "particles leaving trails" feature being turned on.
[edit2] Hmm. No, that feature is just crashing because it calls malloc a lot. The other bugs were happening before trails were even used. Now inserting a random malloc test in various places to figure out where it started being corrupt.
[edit3] Ha! Inserting the random memory allocation tests prevents the bug from happening where it otherwise would appear.
[edit4] Found a place where it would try to allocate zero bytes (call malloc(0)) if there aren't any solenoids, plus fixed similar things elsewhere, but that's not fixed the crash.
[edit5] Finding a whole load of corner cases. The "minimal" bad input was actually crashing because it had zero beamline elements, just a particle. I fixed the code for that case. The regular input with a beamline appeared to crash right after I'd cast a "const Thing *" to (Thing *). I don't even know why I did that (it may have been left over from some early C++ port), I just removed the "const" from the function input and removed the cast too.
So that took 4 hours to find. I also notice I'd put the -w switch on to suppress warnings! That's because it warns whenever I convert between int and double types and I do that intentionally a lot. On the other hand, turning warnings on has turned up a few other interesting things.
|I still have hope that the hangs/crashes with the older lattices actually have the same cause... Your detective work is most appreciated.|
|Thanks indeed . Manually redownloaded the lattice from the website and running multithreaded since a week now, the client didn't hang a single time yet.|
|It would be best if I provided you a more recent development build to test with, since a few things have changed since v4.46 (different compiler, new features etc.). Here is the new EXE that you can test with. Unfortunately after all the debugging I'm still getting some errors in that area of the code!! Next time the crash (during interpreting the lattice) comes up repeatedly I'll have to work out some other way of tracking it down.|
|Just noticed that I have a Linac900Ext5Xc2 quarantined result that is hanging on 4.46 command line client. I believe that it has already run twice, not sure if I should just delete the queuecli.txt to resume processing?|
|That's probably the best way of getting your copy of Muon1 operational again, since it will probably generate a different design next time.|
|Deleting the queuecli.txt didn't help, it always hangs around 44.7ns.|
|I did finally get it working again, I had to do a system restore. It appears something that I installed was causing the command line client to hang after running for about 44ns.|
|That's very strange - any idea what it is?|
|I believe it may have been Bitdefender.|
|Another one not to use is "Avast", which gives false positives on about half the things I compile these days.|