40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Mon Sep 15 18:55:51 2014, rana, Update, DAQ, 40m frames onto the cluster 
    Reply  Wed Oct 22 21:06:33 2014, Chris, Update, DAQ, 40m frames onto the cluster 
Message ID: 10632     Entry time: Wed Oct 22 21:06:33 2014     In reply to: 10507
Author: Chris 
Type: Update 
Category: DAQ 
Subject: 40m frames onto the cluster 

Quote:

 Dan Kozak is rsync transferring /frames from NODUS over to the LDAS grid. He's doing this without a BW limit, but even so its going to take a couple weeks. If nodus seems pokey or the net connection to the outside world is too tight, then please let me and him know so that he can throttle the pipe a little.

The recently observed daqd flakiness looks related to this transfer. It appears to still be ongoing:

nodus:~>ps -ef | grep rsync
controls 29089   382  5 13:39:20 pts/1   13:55 rsync -a --inplace --delete --exclude lost+found --exclude .*.gwf /frames/trend
controls 29100   382  2 13:39:43 pts/1    9:15 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10975 131.
controls 29109   382  3 13:39:43 pts/1    9:10 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10978 131.
controls 29103   382  3 13:39:43 pts/1    9:14 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10976 131.
controls 29112   382  3 13:39:43 pts/1    9:18 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10979 131.
controls 29099   382  2 13:39:43 pts/1    9:14 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10974 131.
controls 29106   382  3 13:39:43 pts/1    9:13 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10977 131.
controls 29620 29603  0 20:40:48 pts/3    0:00 grep rsync

Diagnosing the problem:

I logged into fb and ran "top". It said that fb was waiting for disk I/O ~60% of the time (according to the "%wa" number in the header). There were 8 nfsd (network file server) processes running with several of them listed in status "D" (waiting for disk). The daqd logs were ending with errors like the following suggesting that it couldn't keep up with the flow of data:

[Wed Oct 22 18:58:35 2014] main profiler warning: 1 empty blocks in the buffer
[Wed Oct 22 18:58:36 2014] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1098064730 to 1098064731

This all pointed to the possibility that the file transfer load was too heavy.

Reducing the load:

The following configuration changes were applied on fb.

Edited /etc/conf.d/nfs to reduce the number of nfsd processes from 8 to 1:

OPTS_RPC_NFSD="1"

(was "8")

Ran "ionice" to raise the priority of the framebuilder process (daqd):

controls@fb /opt/rtcds/rtscore/trunk/src/daqd 0$ sudo ionice -c 1 -p 10964

And to reduce the priority of the nfsd process:

controls@fb /opt/rtcds/rtscore/trunk/src/daqd 0$ sudo ionice -c 2 -p 11198

I also tried punishing nfsd with an even lower priority ("-c 3"), but that was causing the workstations to lag noticeably.

After these changes the %wa value went from ~60% to ~20%, and daqd seems to die less often, but some further throttling may still be in order.

ELOG V3.1.3-