Very encouraging results from the test last night. The new configuration did not crash once overnight, and seemed to write out full, second trend, and minute trend frames without issue . However, full validity of all the written out frames has not been confirmed.
overview
The configuration under test involves two separate daqd binaries instead of one. We usually run with what is referred to as a "framebuilder" (fb) configuration:
- fb: a single daqd binary that:
- collect the data from the front ends
- coallate full data into frame file format
- calculates trend data
- writes frame files to disk.
The current configuration separates the tasks into multiple separate binaries: a "data concentrator" (dc) and a "frame writer" (fw):
- dc:
- collect data from front ends
- coallate full data into frame file format
- broadcasts frame files over local network
- fw:
- receives frame files from broadcast
- calculates trend data
- writes frame files to disk
This configuration is more like what is run at the sites, where all the various components are separate and run on separate hardware. In our case, I tried just running the two binaries on the same machine, with the broadcast going over the loopback interface. None of the systems that use separated daqd tasks see the failures that we've been seeing with the all-in-one fb configuration (and other sites like AEI have also seen).
My guess is that there's some busted semaphore somewhere in daqd that's being shared between the concentrator and writer components. The writer component probably aquires the lock while it's writing out the frame, which prevents the concentrator for doing what it needs to be doing while the frame is being written out. That causes the concentrator to lock up and die if the frame writing takes too long (which it seems to almost necessarily do, especially when trend frames are also being written out).
results
The current configuration hasn't been tweaked or optimized at all. There is of course basically no documentation on the meaning of the various daqdrc directives. Hopefully I can get Keith Thorne to help me figure out a well optimized configuration.
There is at least one problem whereby the fw component is issuing an excessively large number of re-transmission requests:
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 6 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 8 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 3 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 6 packets; port 7097
2016-06-15_09:46:23 [Wed Jun 15 09:46:23 2016] Ask for retransmission of 1 packets; port 7097
It's unclear why. Presumably the retransmissions requests are being honored, and the fw eventually gets the data it needs. Otherwise I would hope that there would be the appropriate errors.
The data is being written out as expected:
full/11500: total 182G
drwxr-xr-x 2 controls controls 132K Jun 15 09:37 .
-rw-r--r-- 1 controls controls 69M Jun 15 09:37 C-R-1150043856-16.gwf
-rw-r--r-- 1 controls controls 68M Jun 15 09:37 C-R-1150043840-16.gwf
-rw-r--r-- 1 controls controls 68M Jun 15 09:37 C-R-1150043824-16.gwf
-rw-r--r-- 1 controls controls 69M Jun 15 09:36 C-R-1150043808-16.gwf
-rw-r--r-- 1 controls controls 69M Jun 15 09:36 C-R-1150043792-16.gwf
-rw-r--r-- 1 controls controls 68M Jun 15 09:36 C-R-1150043776-16.gwf
-rw-r--r-- 1 controls controls 68M Jun 15 09:36 C-R-1150043760-16.gwf
-rw-r--r-- 1 controls controls 69M Jun 15 09:35 C-R-1150043744-16.gwf
trend/second/11500: total 11G
drwxr-xr-x 2 controls controls 4.0K Jun 15 09:29 .
-rw-r--r-- 1 controls controls 148M Jun 15 09:29 C-T-1150042800-600.gwf
-rw-r--r-- 1 controls controls 148M Jun 15 09:19 C-T-1150042200-600.gwf
-rw-r--r-- 1 controls controls 148M Jun 15 09:09 C-T-1150041600-600.gwf
-rw-r--r-- 1 controls controls 148M Jun 15 08:59 C-T-1150041000-600.gwf
-rw-r--r-- 1 controls controls 148M Jun 15 08:49 C-T-1150040400-600.gwf
-rw-r--r-- 1 controls controls 148M Jun 15 08:39 C-T-1150039800-600.gwf
-rw-r--r-- 1 controls controls 148M Jun 15 08:29 C-T-1150039200-600.gwf
-rw-r--r-- 1 controls controls 148M Jun 15 08:19 C-T-1150038600-600.gwf
trend/minute/11500: total 152M
drwxr-xr-x 2 controls controls 4.0K Jun 15 07:27 .
-rw-r--r-- 1 controls controls 51M Jun 15 07:27 C-M-1150023600-7200.gwf
-rw-r--r-- 1 controls controls 51M Jun 15 04:31 C-M-1150012800-7200.gwf
-rw-r--r-- 1 controls controls 51M Jun 15 01:27 C-M-1150002000-7200.gwf
The frame sizes look more or less as expected, and they seem to be valid as determined with some quick checks with the framecpp command line utilities. |