Problem:
Test points were unavailable last night, even after reboots of c1sus and even restarting the daqd process on the frame builder.
Cause:
Its unclear at this time. My guess is flaky fb and mx_stream codes. At the moment, the daqd often requires several restarts as it segfaults within a minute or two of restarting it.
What we did (aka treating the symptoms):
We rebooted the frame builder machine. I also added the daqd and nds processes to the inittab. Now when these die, they will automatically be restarted.
Steps to add to the inittab on fb
0) If not on fb, ssh -X fb
1) cd /etc/
2) sudo vi inittab or sudo emacs init
3) Add a line like: id:runlevels:action:process
The id is a unqiue 2-4 letter and number identifier for the process
Run levels is the run level of linux that it will start at. 345 will cover the normal cases
action is what to do with the process. Respawn makes it run at startup and also restarts it everytime it dies.
process is the command you want to run
See "man inittab" for more details
In this case we added
daq:345:respawn:/opt/rtcds/caltech/c1/target/fb/daqd -c /opt/rtcds/caltech/c1/target/fb/daqdrc > /opt/rtcds/caltech/c1/target/fb/daqd.log
nds:345:respawn:/opt/rtcds/caltech/c1/target/fb/nds pipe > /opt/rtcds/caltech/c1/target/fb/nds.log
4) Save.
5) Run "sudo /sbin/telinit q". This forces init to rexamine the inittab file
daqd and nds will now automatically restart when they die.
Continuing issues:
When the frame builder dies, the mx_stream processes on the front ends die as well. These need to be restarted manually at the moment by using "sudo /etc/restart_streams" while on c1sus.
The framebuilder code shouldn't be this flaky. |