I've been trying to figure out why daqd keeps crashing, but nothing is fixed yet.
I commented out the line in /etc/inittab that runs daqd automatically, so I could run it manually. Each time I run it ( with ./daqd -c ./daqdrc while in c1/target/fb), it churns along fine for a little while, but eventually spits out something like:
[Thu Oct 16 12:07:23 2014] main profiler warning: 1 empty blocks in the buffer
[Thu Oct 16 12:07:24 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 12:07:25 2014] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1097521658 to 1097521660
Segmentation fault
Or:
[Thu Oct 16 11:43:54 2014] main profiler warning: 1 empty blocks in the buffer
[Thu Oct 16 11:43:55 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:56 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:57 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:58 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:59 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:44:00 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:44:01 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:44:02 2014] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1097520250 to 1097520257
FATAL: exception not rethrown
Aborted
I looked for time disagreements between the FB and the frontends, but they all seem fine. Running ntpdate only corrected things by 5ms. However, looking through /var/log/messages on FB, I found that ntp claims to have corrected the FB's time by ~111600 seconds (~31 hours) when I rebooted it on Monday.
Maybe this has something to do with the timing that the FB is getting? The FE IOPs seem happy with their sync status, but I'm not personally currently aware of how the FB timing is set up.
Addendum:
On Monday, Jamie suggested checking out the situation with FB's RAID. Searching the elog for "empty blocks in the buffer" also brought up posts that mentioned problems with the RAID.
I went to the JetStor RAID web interface at http://192.168.113.119, and it reports everything as healthy; no major errors in the log. Looking at the SMART status of a few of the drives shows nothing out of the ordinary. The RAID is not mounted in read-only mode either, as was the problem mentioned in previous elogs. |