I roped Rolf into coming over and adding his eyes to the problem. After much discussion we couldn't come up with any reasonable explanation for the problems we've been seeing other than daqd just needing a lot more resources that it did before. He said he had some old Sun SunFire X4600s from which we could pilfer memory. I went over to Downs and ripped all the CPU/memory cards out of one of his machines and stuffed them into fb:
fb now has 8 CPU and 16G of RAM
Unfortunately, this is still not enough. Or at least it didn't solve the problem; daqd is showing the same instabilities, falling over a couple of minutes after I turn on trend frame writing. As always, before daqd fails it starts spitting out the following to the logs:
[Thu Jul 9 16:37:09 2015] main profiler warning: 0 empty blocks in the buffer
followed by lines like:
[Thu Jul 9 16:37:27 2015] GPS MISS dcu 44 (ASX); dcu_gps=1120520264 gps=1120519812
right before it dies.
I'm no longer convinced that this is a resource issue, though, judging by the resource usage right before the crash:
top - 16:47:32 up 48 min, 5 users, load average: 0.91, 0.62, 0.61
Tasks: 2 total, 0 running, 2 sleeping, 0 stopped, 0 zombie
Cpu(s): 8.9%us, 0.9%sy, 0.0%ni, 89.1%id, 0.9%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 15952104k total, 13063468k used, 2888636k free, 138648k buffers
Swap: 1023996k total, 0k used, 1023996k free, 7672292k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12016 controls 20 0 8098m 4.4g 104m S 106 29.1 6:45.79 daqd
4953 controls 20 0 53580 6092 5096 S 0 0.0 0:00.04 nds
Load average less than 1 per CPU, plenty of free memory (~3G free, 0 swap), no waiting for IO (0.9%wa), etc. daqd is utilizing lots of threads, which should be spread across many cpus, so even the >100%CPU should be ok. I'm at a loss... |