40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Thu Feb 19 15:45:43 2015, ericq, Update, CDS, Bad CDS behavior 
    Reply  Thu Feb 19 23:23:52 2015, Chris, Update, CDS, Bad CDS behavior 
       Reply  Fri Feb 20 12:08:10 2015, ericq, Update, CDS, Bad CDS behavior 
          Reply  Fri Feb 20 12:29:01 2015, ericq, Update, CDS, All optics damped 
       Reply  Fri Feb 20 14:44:47 2015, ericq, Update, CDS, Bad CDS behavior 
Message ID: 11052     Entry time: Thu Feb 19 23:23:52 2015     In reply to: 11051     Reply to this: 11053   11055
Author: Chris 
Type: Update 
Category: CDS 
Subject: Bad CDS behavior 

The frontends have some paths NFS-mounted from fb. fb is on the ragged edge of being I/O bound. I'd suggest moving those mounts to chiara. I tried increasing the number of NFS threads on fb (undoing the configuration change I'd previously made here) and it seems to help with EPICS smoothness -- although there are still occasional temporal anomalies in the time channels. The daqd flakiness (which was what led me to throttle NFS on fb in the first place) may now recur as well.



At about 10AM, the C1LSC frontend stopped reporting any EPICS information. The arms were locked at the time, and remained so for some hours, until I noticed the totally whited-out MEDM screens. The machine would respond to pings, but did not respond to ssh, so we had to manually reboot.

Soon thereafter, we had a global 15min EPICS freeze, and have been in a weird state ever since. Epics has come back (and frozen again), but the fast frontends are still wonky, even when EPICS is not frozen. Intermittantly, the status blinkers and GPS time EPICS values will freeze for multiple seconds at a time, sporadically updating. Looking at a StripTool trace of an IOPs GPS time value shows a line with smooth portions for about 30 seconds, about 2 minutes apart. Between this is totally jagged step function behavior. C1LSC needed to be power cycled again; trying to restart the models is tough, because the EPICS slowdown makes it hard to hit the BURT button, as is needed for the model to start without crashing.

The DAQ network switch, and martian switch inside were power cycled, to little effect. I'm not sure how to diagnose network issues with the frontends. Using iperf, I am able to show hundreds of Mbit/s bandwidth betweem the control room machines and the frontends, but their EPICS is still totally wonky. 

What can we do??? indecision


ELOG V3.1.3-