40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Sat May 14 08:36:03 2022, Chris, Update, DAQ, DAQ troubleshooting 
    Reply  Mon May 16 10:49:01 2022, Anchal, Update, DAQ, DAQ troubleshooting 
       Reply  Mon May 16 12:59:27 2022, Chris, Update, DAQ, DAQ troubleshooting timeseries.pngerr.png
Message ID: 16853     Entry time: Sat May 14 08:36:03 2022     Reply to this: 16854
Author: Chris 
Type: Update 
Category: DAQ 
Subject: DAQ troubleshooting 

I heard a rumor about a DAQ problem at the 40m.

To investigate, I tried retrieving data from some channels under C1:SUS-AS1 on the c1sus2 front end. DQ channels worked fine, testpoint channels did not. This pointed to an issue involving the communication with awgtpman. However, AWG excitations did work. So the issue seemed to be specific to the communication between daqd and awgtpman.

daqd logs were complaining of an error in the tpRequest function: error code -3/couldn't create test point handle. (Confusingly, part of the error message was buffered somewhere, and would only print after a subsequent connection to daqd was made.) This message signifies some kind of failure in setting up the RPC connection to awgtpman. A further error string is available from the system to explain the cause of the failure, but daqd does not provide it. So we have to guess...

One of the reasons an RPC connection can fail is if the server name cannot be resolved. Indeed, address lookup for c1sus2 from fb1 was broken:

$ host c1sus2
Host c1sus2 not found: 3(NXDOMAIN)

In /etc/resolv.conf on fb1 there was the following line:

search martian.113.168.192.in-addr.arpa

Changing this to search martian got address lookup on fb1 working:

$ host c1sus2
c1sus2.martian has address 192.168.113.87

But testpoints still could not be retrieved from c1sus2, even after a daqd restart.

In /etc/hosts on fb1 I found the following:

192.168.113.92  c1sus2

Changing the hardcoded address to the value returned by the nameserver (192.168.113.87) fixed the problem.

It might be even better to remove the hardcoded addresses of front ends from the hosts file, letting DNS function as the sole source of truth. But a full system restart should be performed after such a change, to ensure nothing else is broken by it. I leave that for another time.

ELOG V3.1.3-