Quote: |
Dan Kozak is rsync transferring /frames from NODUS over to the LDAS grid. He's doing this without a BW limit, but even so its going to take a couple weeks. If nodus seems pokey or the net connection to the outside world is too tight, then please let me and him know so that he can throttle the pipe a little.
|
The recently observed daqd flakiness looks related to this transfer. It appears to still be ongoing:
nodus:~>ps -ef | grep rsync
controls 29089 382 5 13:39:20 pts/1 13:55 rsync -a --inplace --delete --exclude lost+found --exclude .*.gwf /frames/trend
controls 29100 382 2 13:39:43 pts/1 9:15 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10975 131.
controls 29109 382 3 13:39:43 pts/1 9:10 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10978 131.
controls 29103 382 3 13:39:43 pts/1 9:14 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10976 131.
controls 29112 382 3 13:39:43 pts/1 9:18 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10979 131.
controls 29099 382 2 13:39:43 pts/1 9:14 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10974 131.
controls 29106 382 3 13:39:43 pts/1 9:13 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10977 131.
controls 29620 29603 0 20:40:48 pts/3 0:00 grep rsync
Diagnosing the problem:
I logged into fb and ran "top". It said that fb was waiting for disk I/O ~60% of the time (according to the "%wa" number in the header). There were 8 nfsd (network file server) processes running with several of them listed in status "D" (waiting for disk). The daqd logs were ending with errors like the following suggesting that it couldn't keep up with the flow of data:
[Wed Oct 22 18:58:35 2014] main profiler warning: 1 empty blocks in the buffer
[Wed Oct 22 18:58:36 2014] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1098064730 to 1098064731
This all pointed to the possibility that the file transfer load was too heavy.
Reducing the load:
The following configuration changes were applied on fb.
Edited /etc/conf.d/nfs to reduce the number of nfsd processes from 8 to 1:
OPTS_RPC_NFSD="1"
(was "8")
Ran "ionice" to raise the priority of the framebuilder process (daqd):
controls@fb /opt/rtcds/rtscore/trunk/src/daqd 0$ sudo ionice -c 1 -p 10964
And to reduce the priority of the nfsd process:
controls@fb /opt/rtcds/rtscore/trunk/src/daqd 0$ sudo ionice -c 2 -p 11198
I also tried punishing nfsd with an even lower priority ("-c 3"), but that was causing the workstations to lag noticeably.
After these changes the %wa value went from ~60% to ~20%, and daqd seems to die less often, but some further throttling may still be in order. |