I spent this afternoon trying to debug fb1, with very little to show for it. We're back to running from fb.
The first thing I did was to recompile EPICS from source, so that all the libraries needed by daqd were compiled for the system at hand. I compiled epics-3.14-12-2_long from source, and installed it at /opt/rtapps/epics on local disk, not on the /opt/rtapps network mount. I then recompiled daqd against that, and the framecpp, gds, etc from the LSCSoft packages. So everything has been compiled for this version of the OS. The compilation goes smoothly.
There are two things that I see while running this new daqd on fb1:
The mx stream connection between the front ends and the daqd is flaky. Everything will run fine for a while, the spontaneously one or all of the mx_stream processes on the front ends will die. It appears more likely that all mx_stream processes will die at the same time. It's unclear if this is some sort of chain reaction thing, or if something in daqd or in the network itself is causing them all to die at the same time. It is independent of whether or not we're using multiple mx "end points" (i.e. a different one for each front end and separate receiver threads in the daqd) or just a single one (all front ends connecting to a single mx receiver thread in daqd).
Frequently daqd will recover from this. The monit processes on the front ends restart the mx_stream processes and all will be recovered. However occaissionally, possibly if the mx_streams do not recover fast enough (which seems to be related to how frequently the receiver threads in daqd can clear themselves), daqd will start to choke and will start spitting out the "empty blocks" messages that are harbirnger of doom:
Aborted 2 send requests due to remote peer 00:30:48:be:11:5d (c1iscex:0) disconnected
00:30:48:d6:11:17 (c1iscey:0) disconnected
mx_wait failed in rcvr eid=005, reqn=182; wait did not complete; status code is Remote endpoint is closed
disconnected from the sender on endpoint 005
mx_wait failed in rcvr eid=001, reqn=24; wait did not complete; status code is Remote endpoint is closed
disconnected from the sender on endpoint 001
[Wed Dec 9 18:40:14 2015] main profiler warning: 1 empty blocks in the buffer
[Wed Dec 9 18:40:15 2015] main profiler warning: 0 empty blocks in the buffer
[Wed Dec 9 18:40:16 2015] main profiler warning: 0 empty blocks in the buffer
My suspicion is that this time of failure is tied to the mx stream failures, so we should be looking at the mx connections and network to solve this problem.
There's possibly a separate issue associated with writing the second or minute trend files to disk. With fair regularity daqd will die soon after it starts to write out the trend frames, producing the similar "empty blocks" messages.