40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Tue Jun 14 19:37:40 2016, jamie, Update, CDS, Overnight daqd test underway 
    Reply  Wed Jun 15 09:52:02 2016, jamie, Update, CDS, Very encouraging results from overnight split daqd test 
       Reply  Wed Jun 15 11:21:51 2016, jamie, Update, CDS, still work to do to transition to new configuration/code 
          Reply  Thu Jun 16 16:11:11 2016, jamie, Update, CDS, upgrade aborted for now 
Message ID: 12181     Entry time: Wed Jun 15 09:52:02 2016     In reply to: 12179     Reply to this: 12183
Author: jamie 
Type: Update 
Category: CDS 
Subject: Very encouraging results from overnight split daqd test 

laughVery encouraging results from the test last night.  The new configuration did not crash once overnight, and seemed to write out full, second trend, and minute trend frames without issueyes.  However, full validity of all the written out frames has not been confirmed.

overview

The configuration under test involves two separate daqd binaries instead of one.  We usually run with what is referred to as a "framebuilder" (fb) configuration:

  • fb: a single daqd binary that:
    • collect the data from the front ends
    • coallate full data into frame file format
    • calculates trend data
    • writes frame files to disk.

The current configuration separates the tasks into multiple separate binaries: a "data concentrator" (dc) and a "frame writer" (fw):

  • dc:
    • collect data from front ends
    • coallate full data into frame file format
    • broadcasts frame files over local network
  • fw:
    • receives frame files from broadcast
    • calculates trend data
    • writes frame files to disk

This configuration is more like what is run at the sites, where all the various components are separate and run on separate hardware.  In our case, I tried just running the two binaries on the same machine, with the broadcast going over the loopback interface.  None of the systems that use separated daqd tasks see the failures that we've been seeing with the all-in-one fb configuration (and other sites like AEI have also seen).

My guess frown is that there's some busted semaphore somewhere in daqd that's being shared between the concentrator and writer components.  The writer component probably aquires the lock while it's writing out the frame, which prevents the concentrator for doing what it needs to be doing while the frame is being written out.  That causes the concentrator to lock up and die if the frame writing takes too long (which it seems to almost necessarily do, especially when trend frames are also being written out).

results

The current configuration hasn't been tweaked or optimized at all.  There is of course basically no documentation on the meaning of the various daqdrc directives.  Hopefully I can get Keith Thorne to help me figure out a well optimized configuration.

There is at least one problem whereby the fw component is issuing an excessively large number of re-transmission requests:

2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 6 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 8 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 3 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 6 packets; port 7097
2016-06-15_09:46:23 [Wed Jun 15 09:46:23 2016] Ask for retransmission of 1 packets; port 7097

It's unclear why.  Presumably the retransmissions requests are being honored, and the fw eventually gets the data it needs.  Otherwise I would hope that there would be the appropriate errors.

The data is being written out as expected:

 full/11500: total 182G
drwxr-xr-x  2 controls controls 132K Jun 15 09:37 .
-rw-r--r--  1 controls controls  69M Jun 15 09:37 C-R-1150043856-16.gwf
-rw-r--r--  1 controls controls  68M Jun 15 09:37 C-R-1150043840-16.gwf
-rw-r--r--  1 controls controls  68M Jun 15 09:37 C-R-1150043824-16.gwf
-rw-r--r--  1 controls controls  69M Jun 15 09:36 C-R-1150043808-16.gwf
-rw-r--r--  1 controls controls  69M Jun 15 09:36 C-R-1150043792-16.gwf
-rw-r--r--  1 controls controls  68M Jun 15 09:36 C-R-1150043776-16.gwf
-rw-r--r--  1 controls controls  68M Jun 15 09:36 C-R-1150043760-16.gwf
-rw-r--r--  1 controls controls  69M Jun 15 09:35 C-R-1150043744-16.gwf

 trend/second/11500: total 11G
drwxr-xr-x  2 controls controls 4.0K Jun 15 09:29 .
-rw-r--r--  1 controls controls 148M Jun 15 09:29 C-T-1150042800-600.gwf
-rw-r--r--  1 controls controls 148M Jun 15 09:19 C-T-1150042200-600.gwf
-rw-r--r--  1 controls controls 148M Jun 15 09:09 C-T-1150041600-600.gwf
-rw-r--r--  1 controls controls 148M Jun 15 08:59 C-T-1150041000-600.gwf
-rw-r--r--  1 controls controls 148M Jun 15 08:49 C-T-1150040400-600.gwf
-rw-r--r--  1 controls controls 148M Jun 15 08:39 C-T-1150039800-600.gwf
-rw-r--r--  1 controls controls 148M Jun 15 08:29 C-T-1150039200-600.gwf
-rw-r--r--  1 controls controls 148M Jun 15 08:19 C-T-1150038600-600.gwf

 trend/minute/11500: total 152M
drwxr-xr-x 2 controls controls 4.0K Jun 15 07:27 .
-rw-r--r-- 1 controls controls  51M Jun 15 07:27 C-M-1150023600-7200.gwf
-rw-r--r-- 1 controls controls  51M Jun 15 04:31 C-M-1150012800-7200.gwf
-rw-r--r-- 1 controls controls  51M Jun 15 01:27 C-M-1150002000-7200.gwf

The frame sizes look more or less as expected, and they seem to be valid as determined with some quick checks with the framecpp command line utilities.

ELOG V3.1.3-