40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Thu Oct 16 03:18:48 2014, Jenne, Update, CDS, Daqd segfaulting again 
    Reply  Thu Oct 16 12:22:43 2014, ericq, Update, CDS, Daqd segfaulting again 
       Reply  Fri Oct 17 15:17:31 2014, jamie, Update, CDS, Daqd "fixed"? 
          Reply  Fri Oct 17 16:54:11 2014, jamie, Update, CDS, Daqd "fixed"? 
             Reply  Thu Oct 23 01:39:34 2014, Jenne, Update, CDS, Daqd "fixed"? 
Message ID: 10617     Entry time: Thu Oct 16 12:22:43 2014     In reply to: 10616     Reply to this: 10623
Author: ericq 
Type: Update 
Category: CDS 
Subject: Daqd segfaulting again 

I've been trying to figure out why daqd keeps crashing, but nothing is fixed yet. 

I commented out the line in /etc/inittab that runs daqd automatically, so I could run it manually. Each time I run it ( with ./daqd -c ./daqdrc while in c1/target/fb), it churns along fine for a little while, but eventually spits out something like:

[Thu Oct 16 12:07:23 2014] main profiler warning: 1 empty blocks in the buffer
[Thu Oct 16 12:07:24 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 12:07:25 2014] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1097521658 to 1097521660
Segmentation fault
 
Or:
 
[Thu Oct 16 11:43:54 2014] main profiler warning: 1 empty blocks in the buffer
[Thu Oct 16 11:43:55 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:56 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:57 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:58 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:59 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:44:00 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:44:01 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:44:02 2014] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1097520250 to 1097520257
FATAL: exception not rethrown
Aborted

I looked for time disagreements between the FB and the frontends, but they all seem fine. Running ntpdate only corrected things by 5ms. However, looking through /var/log/messages on FB, I found that ntp claims to have corrected the FB's time by ~111600 seconds (~31 hours) when I rebooted it on Monday.

Maybe this has something to do with the timing that the FB is getting? The FE IOPs seem happy with their sync status, but I'm not personally currently aware of how the FB timing is set up. 


Addendum:

On Monday, Jamie suggested checking out the situation with FB's RAID. Searching the elog for "empty blocks in the buffer" also brought up posts that mentioned problems with the RAID. 

I went to the JetStor RAID web interface at http://192.168.113.119, and it reports everything as healthy; no major errors in the log. Looking at the SMART status of a few of the drives shows nothing out of the ordinary. The RAID is not mounted in read-only mode either, as was the problem mentioned in previous elogs. 

ELOG V3.1.3-