40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 42 of 341  Not logged in ELOG logo
ID Date Author Type Category Subjectup
  13670   Thu Mar 8 14:41:25 2018 gautamUpdateGeneralCDS recovery after work at LSC rack

As I had found before, restarting the c1oaf model fixed the DC error. There is however still a pesky red indicator light on the "ADC0" in c1oaf. Trying to open up the ADC MEDM screen to investigate this further leads to the blank screen on the bottom right of Attachment #1. Probably has something to do with the fact that the model has an ADC block (because every model needs one?) but no signals are actually being piped to the model directly from the ADC.

Another observation, though I don't have any hypothesis as to why this was happening: on the c1sus machine, the c1sus model would frequently overclock, and then eventually, crash. I observed this behaviour at least 3 times between last night and now. The other models seemed fine though, in fact, IMC stayed locked. Why should this have been the case? It remains to be seen if this was somehow connected to the red DC indicator on c1oaf, though why should this be the case? Isn't the DC just concerned with writing data to frames? Any sort of IPC should be independent? Attachment #2 shows that there's been a definite increase in the maximum time on c1sus clock-cycle since yesterday (it's a 10 day minute trend plot of the model clock cycle timing and also the maximum time). Why? Koji and I did switch off all the Sorensens at the LSC rack for about 30mins, but why should this affect anything at 1X6? There are no red lights in either the c1lsc or c1sus expansion chassis. Curiously, the PRM also seems to be glitchy - as I'm sitting in the control room, I see a spot flashing across vertically on the REFL CRT monitor sporadically. Note that nominally, with PRM misaligned, the REFL CRT should be dark. dmesg on c1sus doesn't shed any light on the issue.

Seems like some high level voodoo indecision.


Edit 330pm: The model just crashed again. dmesg rather unhelpfully just says "ADC timeout". Unclear how to debug further. See Attachment #3.

Quote:

This required multiple hard reboots, but seems like all the RT models are back for now. The only indicator I can't explain is the red DC field on c1oaf. Also, the SUS model seems to be overclocking more frequently than usual, though I can't be sure. The "timing" field of this model's state word is RED, while the other models all seem fine. Not sure what could be going on.

Will debug further tomorrow, when I probably will have to do all this again as I'll need to recompile c1lsc for the ALS electronics test with the new ADC card from the differential AA board.

Attachment 1: CDS-recovery.png
CDS-recovery.png
Attachment 2: c1sus_timing.png
c1sus_timing.png
Attachment 3: c1sus_crashed.png
c1sus_crashed.png
  13672   Thu Mar 8 18:15:42 2018 gautamUpdateGeneralCDS recovery after work at LSC rack

I was forced into a simultaneous power-cycle rebooting of the three vertex FEs just now. I took the opportunity to completely disconnect the c1sus expansion chassis from all power and then restart it.

Everything is back up right now, and the weird timing issues I noticed in the sus model seem to be gone now (I'll need a longer baseline to be sure and I'll post a trend of the CPU timing tomorrow). It's disconcerting that apparently the only way to get everything back up and running is the nuclear option of power-cycling all FE related electronics. I was considering borrowing an ADC adapter card from the Y end and measuring the calibrated IR ALS noise with the digital system, but if I'm going to have to go through this whole dance each time I do a model recompile on c1lsc (which I'm going to have to in order to get the extra ADC recognized), I'm wondering if it's just better to wait till we get the new adapter cards we ordered. I think I'm going to work on tuning the input coupling into the fiber at EX in the next couple of days instead.

Quote:
 

Seems like some high level voodoo indecision.


Edit 330pm: The model just crashed again. dmesg rather unhelpfully just says "ADC timeout". Unclear how to debug further. See Attachment #3.

 

  13477   Thu Dec 14 19:41:00 2017 gautamUpdateCDSCDS recovery, NFS woes

[Koji, Jamie(remote), gautam]

Summary: The CDS system seems to be back up and functioning. But there seems to be some pending problems with the NFS that should be looked into.

We locked Y-arm, hand aligned transmission to 1yes. Some pending problems with ASS model (possibly symptomatic of something more general). Didn't touch Xarm because we don't know what exactly the status of ETMX is.

Problems raised in elogs in the thread of 13474 and also 13436 seem to be solved.


I would make a detailed post with how the problems were fixed, but unfortunately, most of what we did was not scientific/systematic/repeatable. Instead, I note here some general points (Jamie/Koji can addto /correct me):

  1. There is a "known" problem with unloading models on c1lsc. Sometimes, running rtcds stop <model> will kill the c1lsc frontend.
  2. Sometimes, when one machine on the dolphin network goes down, all 3 go down.
  3. The new FB/RCG means that some of the old commands now no longer work. Specifically, instead of telnet fb 8087 followed by shutdown (to fix DC errors) no longer works. Instead, ssh into fb1, and run sudo systemctl restart daqd_*.
  4. Timing error on c1sus machine was linked to the mx_stream processes somehow not being automatically started. The "!mxstream restart" button on the CDS overview MEDM screen should run the necessary commands to restart it. However, today, I had to manually run sudo systemctl start mx_stream on c1sus to fix this error. It is a mystery why the automatic startup of this process was disabled in the first place. Jamie has now rectified this problem, so keep an eye out.
  5. c1oaf persistently reported DC errors (0x2bad) that couldn't be addressed by running mxstream restart or restarting the daqd processes on FB1. Restarting the model itself (i.e. rtcds restart c1oaf) fixed this issue (though of course I took the risk of having to go into the lab and hard-reboot 3 machines).
  6. At some point, we thought we had all the CDS lights green - but at that point, the END FEs crashed, necessitating Koji->EX and Gautam->EY hard reboots. This is a new phenomenon. Note that the vertex machines were unaffected.
  7. At some point, all the DC lights on the CDS overview screen went white - at the same time, we couldn't ssh into FB1, although it was responding to ping. After ~2mins, the green lights came back and we were able to connect to FB1. Not sure what to make of this.
  8. While trying to run the dither alignment scripts for the Y-arm, we noticed some strange behaviour:
    • Even when there was no signal (looking at EPICS channels) at the input of the ASS servos, the output was fluctuating wildly by ~20cts-pp.
    • This is not simply an EPICS artefact, as we could see corresponding motion of the suspension on the CCD.
    • A possible clue is that when I run the "Start Dither" script from the MEDM screen, I get a bunch of error messages (see Attachment #2).
    • Similar error messages show up when running the LSC offset script for example. Seems like there are multiple ports open somehow on the same machine?
    • There are no indicator lights on the CDS overview screen suggesting where the problem lies.
    • Will continue investigating tomorrow.

Some other general remarks:

  1. ETMX watchdog remains shutdown.
  2. ITMY and BS oplevs have been hijacked for HeNe RIN / Oplev sensing noise measurement, and so are not enabled.
  3. Y arm trans QPD (Thorlabs) has large 60Hz harmonics. These can be mitigated by turning on a 60Hz comb filter, but we should check if this is some kind of ground loop. The feature is much less evident when looking at the TRANS signal on the QPD.

UPDATE 8:20pm:

Koji suggested trying to simply retsart the ASS model to see if that fixes the weird errors shown in Attachment #2. This did the trick. But we are now faced with more confusion - during the restart process, the various indicators on the CDS overview MEDM screen froze up, which is usually symptomatic of the machines being unresponsive and requiring a hard reboot. But we waited for a few minutes, and everything mysteriously came back. Over repeated observations and looking at the dmesg of the frontend, the problem seems to be connected with an unresponsive NFS connection. Jamie had noted sometime ago that the NFS seems unusually slow. How can we fix this problem? Is it feasible to have a dedicated machine that is not FB1 do the NFS serving for the FEs?

Attachment 1: CDS_14Dec2017.png
CDS_14Dec2017.png
Attachment 2: CDS_errors.png
CDS_errors.png
  13480   Fri Dec 15 01:53:37 2017 jamieUpdateCDSCDS recovery, NFS woes
Quote:

I would make a detailed post with how the problems were fixed, but unfortunately, most of what we did was not scientific/systematic/repeatable. Instead, I note here some general points (Jamie/Koji can addto /correct me):

  1. There is a "known" problem with unloading models on c1lsc. Sometimes, running rtcds stop <model> will kill the c1lsc frontend.
  2. Sometimes, when one machine on the dolphin network goes down, all 3 go down.
  3. The new FB/RCG means that some of the old commands now no longer work. Specifically, instead of telnet fb 8087 followed by shutdown (to fix DC errors) no longer works. Instead, ssh into fb1, and run sudo systemctl restart daqd_*.

This should still work, but the address has changed.  The daqd was split up into three separate binaries to get around the issue with the monolithic build that we could never figure out.  The address of the data concentrator (DC) (which is the thing that needs to be restarted) is now 8083.

Quote:

UPDATE 8:20pm:

Koji suggested trying to simply retsart the ASS model to see if that fixes the weird errors shown in Attachment #2. This did the trick. But we are now faced with more confusion - during the restart process, the various indicators on the CDS overview MEDM screen froze up, which is usually symptomatic of the machines being unresponsive and requiring a hard reboot. But we waited for a few minutes, and everything mysteriously came back. Over repeated observations and looking at the dmesg of the frontend, the problem seems to be connected with an unresponsive NFS connection. Jamie had noted sometime ago that the NFS seems unusually slow. How can we fix this problem? Is it feasible to have a dedicated machine that is not FB1 do the NFS serving for the FEs?

I don't think the problem is fb1.  The fb1 NFS is mostly only used during front end boot.  It's the rtcds mount that's the one that sees all the action, which is being served from chiara.

  13481   Fri Dec 15 11:19:11 2017 gautamUpdateCDSCDS recovery, NFS woes

Looking at the dmesg on c1iscex for example, at least part of the problem seems to be associated with FB1 (192.168.113.201, see Attachment #1). The "server" can be unresponsive for O(100) seconds, which is consistent with the duration for which we see the MEDM status lights go blank, and the EPICS records get frozen. Note that the error timestamped ~4000 was from last night, which means there have been at least 2 more instances of this kind of freeze-up overnight.

I don't know if this is symptomatic of some more widespread problem with the 40m networking infrastructure. In any case, all the CDS overview screen lights were green today morning, and MC autolocker seems to have worked fine overnight.

I have also updated the wiki page with the updated daqd restart commands.

Unrelated to this work - Koji fixed up the MC overview screen such that the MC autolocker button is now visible again. The problem seems to do with me migrating some of the c1ioo EPICS channels from the slow machine to the fast system, as a result of which the EPICS variable type changed from "ENUM" to something that was not "ENUM". In any case, the button exists now, and the MC autolocker blinky light is responsive to its state.

Quote:

I don't think the problem is fb1.  The fb1 NFS is mostly only used during front end boot.  It's the rtcds mount that's the one that sees all the action, which is being served from chiara.

 

Attachment 1: NFS.png
NFS.png
Attachment 2: MCautolocker.png
MCautolocker.png
  9449   Fri Dec 6 21:38:27 2013 KojiUpdateLSCCDS related activities for LSC

I worked on the CDS related stuffs for LSC yesterday and today.


1. Slow machines:

I checked the database files for c1iscaux and c1iscaux2 (slow machines). They are mainly
used for the control of LSC whitening filters. The channel names were totally random as we
reconfigured the RF PDs while the channel names had been unchanged.

- Now the database was modified so that the PD name and the channels are related.
- saverestore.req and autoBurt.req were also changed accordingly.

- PD interface channels are completely random. Don't use them.
- I found the whitening of DCPDs are not effective.

- We need to clean up /cvs/cds/caltech/target directory. The autoBurt requests in the old targets
are making unnecessary burt files.

2. LSC screens

- The channel names on the LSC OVERVIEW screen was modified. (Attachment 1)
- A new LSC Whitening screen was made. (Attachment 2)

3. LSC screen generator

To touch the main LSC screen is very tough. The screen was split in to several sub screens
and combined with a command.

/opt/rtcds/caltech/c1/medm/c1lsc/master/generateLSCscreen/generateLSCscreen.py

This command combines the multiple adl files into a single file with x&y offsets.
This way, you can work with the each section of the screen.
Also, moving the blocks are just easy.

4. LSC Code Bug?

During the screen making, I found that a couple of the whitening switches are not
working properly.
e.g. When AS165 (either I or Q) FM1 is activated throught the whitening trigger,
the MSB bit (bit15) of the binary I/O (C1:LSC-BIO_0_0) does not .

SImilarly ASDC FM1 does not toggle bit15 of C1:LSC-BIO_0_1.

The other channels seems OK.

At first, I thought this is a bug of "Bit2Word" block. But an individual test of the block showed that
the block is not guilty. So why is only Bit15 malfunctioning???

 

Attachment 1: LSC1.png
LSC1.png
Attachment 2: LSC2.png
LSC2.png
  3034   Wed Jun 2 11:25:16 2010 josephb,alexUpdateCDSCDS saga (aka the bad code saga)

Alex updated the awg.par file to handle all the testpoints.  Basically its very similar to the testpoint.par, but the prognum lines have to be 1 higher than the corresponding prognum in testpoint.par.  A entry looks like:

[C1-awg0]
hostname=192.168.1.2
prognum=0x31001002

After running "diag -i" and seeing some RPC number conflicts, we went into /cvs/cds/caltech/cds/target/gds/param/diag_C.conf and changed the line from

&chn * *  192.168.1.2 822087685 1

to

&chn * *  192.168.1.2 822087700 1

The number represents an RPC number.  This was conflicting with the RPC number associated with the awgtpman processes.  We then had to update the /etc/rpc file as well.  At the end we changed chnconf 822087685 to chnconf 822087700.  We then run /usr/sbin/xinetd reload

Lastly we edited the /etc/xinetd.d/chnconf file line

server_args             = /cvs/cds/caltech/target/gds/param/tpchn_C4.par /cvs/cds/caltech/target/gds/param/tpchn_C5.par

to

server_args             = /cvs/cds/caltech/target/gds/param/tpchn_C1.par /cvs/cds/caltech/target/gds/param/tpchn_C2.par /cvs/cds/caltech/target/gds/param/tpchn_C3.par /cvs/cds/caltech/target/gds/param/tpchn_C4.par /cvs/cds/caltech/target/gds/param/tpchn_C5.par /cvs/cds/caltech/target/gds/param/tpchn_C6.par /cvs/cds/caltech/target/gds/param/tpchn_C7.par /cvs/cds/caltech/target/gds/param/tpchn_C8.par /cvs/cds/caltech/target/gds/param/tpchn_C9.par

 

Alex also recompiled the frame builder code to be able to handle more than 7 front ends.  This involved tracking down a newer version of libtestpoint.so on c1iscex and moving it over to megatron, then going in and by hand adding the ability to have up to 10 front ends connected.

Alex has said he doesn't like this code and would like it to dynamically allocate properly for any number of servers rather than having a dumb hard coded limit.

Other changes he needs to make:

1) Get rid of set dcu_rate ## = 16384 type lines in the daqrc file.  That information is available from the /caltech/chans/C1LSC.ini type files which are automatically generated when you compile a model.  This means not having to go in by hand to update these in daqrc.

2) Get some awg.par and testpoint.par rules, so that these are automatically updates when you build a model.  Make it so it automatically assigns a prognum when read in rather than having to hard code them in by hand.

3)Slave the awgtpmans to a single clock running from the IO processor x00. This ensures they are all in sync.

 

 

 

  14149   Thu Aug 9 12:31:13 2018 gautamUpdateCDSCDS status update

The model seems to have run without issues overnight. Not completely related, but the MC1 shadow sensor signals also don't show any abnormal excursions to negative values in the last 48 hours. I'm thinking about re-connecting the satellite box (but preserving the breakout setup at 1X6 for a while longer) and re-locking the IMC. I'll also start c1ass on the c1lsc frontend. I would say that the other models on c1lsc (i.e. c1oaf, c1cal, c1daf) aren't really necessary for basic IFO operation.

Quote:

As part of this slow but systematic debugging, I am turning on the c1lsc model overnight to see if the model crashes return.

  14166   Wed Aug 15 21:27:47 2018 gautamUpdateCDSCDS status update

Starting c1cal now, let's see if the other c1lsc FE models are affected at all... Moreover, since MC1 seems to be well-behaved, I'm going to restore the nominal eurocrate configuration (sans extender board) tomorrow.

  14192   Tue Sep 4 10:14:11 2018 gautamUpdateCDSCDS status update

c1lsc crashed again. I've contacted Rolf/JHanks for help since I'm out of ideas on what can be done to fix this problem.

Quote:

Starting c1cal now, let's see if the other c1lsc FE models are affected at all... Moreover, since MC1 seems to be well-behaved, I'm going to restore the nominal eurocrate configuration (sans extender board) tomorrow.

  14193   Wed Sep 5 10:59:23 2018 wgautamUpdateCDSCDS status update

Rolf came by today morning. For now, we've restarted the FE machine and the expansion chassis (note that the correct order in which to do this is: turn off computer--->turn off expansion chassis--->turn on expansion chassis--->turn on computer). The debugging measures Rolf suggested are (i) to replace the old generation ADC card in the expansion chassis which has a red indicator light always on and (ii) to replace the PCIe fiber (2010 make) running from the c1lsc front-end machine in 1X6 to the expansion chassis in 1Y3, as the manufacturer has suggested that pre-2012 versions of the fiber are prone to failure. We will do these opportunistically and see if there is any improvement in the situation.

Another tip from Rolf: if the c1lsc FE is responsive but the models have crashed, then doing sudo reboot by ssh-ing into c1lsc should suffice* (i.e. it shouldn't take down the models on the other vertex FEs, although if the FE is unresponsive and you hard reboot it, this may still be a problem). I'll modify I've modified the c1lsc reboot script accordingly.

* Seems like this can still lead to the other vertex FEs crashing, so I'm leaving the reboot script as is (so all vertex machines are softly rebooted when c1lsc models crash).

Quote:

c1lsc crashed again. I've contacted Rolf/JHanks for help since I'm out of ideas on what can be done to fix this problem.

  9077   Wed Aug 28 00:41:23 2013 JenneUpdateCDSCDS svn commits not happening

svn status update. asx, als and ioo were found not committed. Not sure about who modified ioo last after Jenne.

//edit Manasa - edited the/ elog instead of replying //

  9079   Wed Aug 28 05:21:58 2013 manasaUpdateCDSCDS svn commits not happening

I am responsible for missed svn commits with als and asx. I have committed them.

But I have not modified anything with ioo in the last few weeks.

 

  13166   Fri Aug 4 09:07:28 2017 ranaUpdateCDSCDS system essentially NOT fully recovered

Tried getting trends with dataviewer just now since Jamie re-enabled the minute_raw frame writing yesterday. Unable to get trends still:

Connecting to NDS Server fb1 (TCP port 8088)
Connecting.... done
Server error 18: trend data is not available
datasrv: DataWriteTrend failed in daq_send().
unknown error returned from daq_send()T0=17-08-04-08-02-22; Length=28800 (s)
No data output.

  13153   Mon Jul 31 18:44:40 2017 JamieUpdateCDSCDS system essentially fully recovered

The CDS system is mostly fully recovered at this point.  The mx_streams are all flowing from all front ends, and from all models, and the daqd processes are receiving them and writing the data to frames:

Remaining unresolved issues:

  • IFO needs to be fully locked to make sure ALL components of all models are working.
  • The remaining red status lights are from the "FB NET" diagnostics, which are reflecting a missing status bit from the front end processes due to the fact that they were compiled with an earlier RCG version (3.0.3) than the mx_streams were (3.3+/trunk).  There will be a new release of the RTS soon, at which point we'll compile everything from the same version, which should get us all green again.
  • The entire system has been fully modernized, to the target CDS reference OS (Debian jessie) and more recent RCG versions.  The management of the various RTS components, both on the front ends and on fb, have as much as possible been updated to use the modern management tools (e.g. systemd, udev, etc.).  These changes need to be documented.  In particular...
  • The fb daqd process has been split into three separate components, a configuration that mirrors what is done at the sites and appears to be more stable: The "target" directory for all of these components is now:
    • daqd_dc: data concentrator (receives data from front ends)
    • daqd_fw: receives frames from dc and writes out full frames and second/minute trends
    • daqd_rcv: NDS1 server (raises test points and receives archive data from frames from 'nds' process)
    The "target" directory for all of these new components is:
    • /opt/rtcds/caltech/c1/target/daqd
    All of these processes are now managed under systemd supervision on fb, meaning the daqd restart procedure has changed.  This needs to be simplified and clarified.
  • Second trend frames are being written, but for some reason they're not accessible over NDS.
  • Have not had a chance to verify minute trend and raw minute trend writing yet.  Needs to be confirmed.
  • Get wiper script working on new fb.
  • Front end RTS kernel will occaissionally crash when the RTS modules are unloaded.  Keith Thorne apparently has a kernel version with a different set of patches from Gerrit Kuhn that does not have this problem.  Keith's kernel needs to be packaged and installed in the front end diskless root.
  • The models accessing the dolphin shared memory will ALL crash when one of the front end hosts on the dolphin network goes away.  This results in a boot fest of all the dolphin-enabled hosts.  Need to figure out what's going on there.
  • The RCG settings snapshotting has changed significantly in later RCG versions.  We need to make sure that all burt backup type stuff is still working correctly.
  • Restoration of /frames from old fb SCSI RAID?
  • Backup of entirety of fb1, including fb1 root (/) and front end diskless root (/diskless)
  • Full documentation of rebuild procedure from Jamie's notes.
  11687   Tue Oct 13 17:04:54 2015 ericqUpdateCDSCDS things

After some discussion at last week's 40m meeting, I increased the frequency of daqd trying to write out minute trends from hourly to every two hours.

This has eliminated the hourly crashes. daqd still crashes sometimes, but only a few times per day. yes

However, looking at the oplev summary pages that actually use the minute trends, it looks like they're only sporadically getting succesfully written out. no


Also, I was having a lot of problems with the frontends' EPICS processes dying when I would try to update the SDF table. I rebuilt all of the frontends with RCG 2.9.6, which differs from the 2.9.4 that we had been running by SDF bugfixes and an RMS calculation bugfix. The SDF procedures are much more stable now. 

I have not yet discovered anything broken by this chage, and the tests I made for the last upgrade were all fine; last weeks tiny DRFPMI lock was achieved after this change. 

  3829   Sat Oct 30 05:27:53 2010 yutaSummaryCDSCDS time delay measurement

Motivation:
  We want to know the time delay of CDS in the IOP scheme.

Setup:
delaysetup.png

What I did:
1. Plugged out SCSI cable from ADC card 2 and DAC card 0 on C1SUS machine.
   ADC card 2 is ADC 0
   DAC card 0 is DAC 0

2. Measured tranfer function between ADC and DAC by SR785 and compared with the downsampling filter in IOP with 65534Hz(=4x16384Hz) sampling frequency.

  As ADC_0_0 corresponds to PRM ULSEN input and DAC_0_0 corrsponds to ULCOIL output, we turned all the filters off and set gains to 0 or 1 so that TF between ULSEN to ULCOIL will be ideally 1. (see this wiki page for channel assigns)

  The filter coefficients for the down sampling filter was found in;
    /cvs/cds/rtcds/caltech/c1/core/advLigoRTS/src/fe/controller.c
  It was named feCoeff4x.

static double feCoeff4x[9] =
        {0.014805052402446,
        -1.71662585474518,    0.78495484219691,   -1.41346289716898,   0.99893884152400,
        -1.68385964238855,    0.93734519457266,    0.00000127375260,   0.99819981588176};


3. Calculated the time delay dt using the following formula;
  dt = [pm - pc]/f/360deg    (pm: measured phase, pc: calculated phase from feCoeff4x, f: frequency)

4. Measured TF between the SCSI cables to estimate the effect of the cables and others.
  Disconnected SCSI cables from ADC and DAC, and connected A aad B(see setup).
  I measured both when input coupling of SR785 is DC and AC and see what happens.

Result:
  [time delay of the CDS]  (left, middle)
    The time delay gets larger with frequency. The time delay seems to be -175 usec at DC.
    However, the gain seems a little different from my expectation(feCoeff4x). So, there are maybe other filters I don't know.
    I neglected TF of upsampling this time.

  [cable and other effect]  (right)
    The effect to the time delay measurement was tiny by a factor of 10^4 to 10^3 (few nsec).
    But the total cable length was about 5 m and assuming signal speed is 0.6c, delay will be about 30 nsec.
    I don't know what's happening.

CDSdelay.png

Plan:
  - make a model that does not go through IOP and see the delay caused by IOP

By the way:
  fb daqd is still running for hours!
  Every FEs are running(c1sus,rms,mcs).

  3830   Sat Oct 30 14:35:43 2010 KojiSummaryCDSCDS time delay measurement

Unsatisfactory.

Neglecting the digital anti-imaging filter makes the discrepancy. You must take into account your digital filter twice.

I attached the slides I made during my visit for March LVC '09. P.5 would be useful.

Quote:

Result:
  [time delay of the CDS]  (left, middle)
    The time delay gets larger with frequency. The time delay seems to be -175 usec at DC.
    However, the gain seems a little different from my expectation(feCoeff4x). So, there are maybe other filters I don't know.
    I neglected TF of upsampling this time.

 

Attachment 1: CDS_system_investigation_090323.pdf
CDS_system_investigation_090323.pdf CDS_system_investigation_090323.pdf CDS_system_investigation_090323.pdf CDS_system_investigation_090323.pdf CDS_system_investigation_090323.pdf CDS_system_investigation_090323.pdf CDS_system_investigation_090323.pdf CDS_system_investigation_090323.pdf
  3838   Mon Nov 1 15:47:15 2010 yutaSummaryCDSCDS time delay measurement

Background:
  I measured CDS time delay last week, but because of my lack of understanding the system, it was incorrect.
  IOP has an anti-aliasing filter before downsampling from 64kHz(65536Hz) to 16kHz(16384Hz) and also has an anti-imaging filter before upsampling from 16kHz to 64kHz.
  So, I should have take feCoeff4x into account twice.
downupsampling.png

Result:
  TF agreed well with 2-time feCoeff4x and CDS time delay was -123.5 usec.
CDSdelay2.png


Plan:
 - make AWG(, diaggui TF measurement, tdssine) work
 - check input/output filter switching (using tdssine & tdsdmd)
 - measure openloop TF of MC suspension damping
 - divide it with my expectation and see if there are any filters I don't know

Quote:

Unsatisfactory.

Neglecting the digital anti-imaging filter makes the discrepancy. You must take into account your digital filter twice.

I attached the slides I made during my visit for March LVC '09. P.5 would be useful.

 

  3839   Mon Nov 1 16:43:24 2010 KojiSummaryCDSCDS time delay measurement

Um, Beautiful.

Actually, 123.5usec is almost exactly twice of 1/16384Hz.
Because of the loop, we have 1/16384Hz delay. I wonder where we do have the delay.

In order to understand the behaviour of the system can I ask you to test the following things?

1) What are the delay without IOPs with fsampl of 16k, 32k, 64k?

2) What are the delay with IOP with fsampl of 32k, 64k?

Quote:

Result:
  TF agreed well with 2-time feCoeff4x and CDS time delay was -123.5 usec.
CDSdelay2.png

 

  3961   Sat Nov 20 03:37:11 2010 yutaSummaryCDSCDS time delay measurement - the ripple

(Koji, Joe, Yuta)

Motivation:
  We wanted to know more about CDS.

Setup:
  Same as in elog #3829.

What we did:

  1. Made test RT models c1tst and c1nio for c1iscex.
     c1tst has only 2 filter module(minimum limit of a model), 2 inputs, 2 outputs and it runs with IOP c1x01.
     c1nio is the same as c1tst except it runs(or, should run) without IOP.

  2. Measured the time delay of ADC through DAC using different machine, different sampling rate by measuring transfer functions.

  3. c1nio(without IOP) didn't seem to be running correctly and we couldn't measure the TF.
     "1 PPS" error appeared in GDS screen(C1:FEC-39_TIME_ERR).
     It looks like c1nio is receiving the signal as we could see in the MEDM screen, but the signal doesn't come out from the DAC.

TF we expected:
  All the filters and gains are set to 1.

  We have DA's TF when putting 64K signal out to analog world.
    D(f)=exp(-i*pi*f*Ts)*sin(pi*f*Ts)/(pi*f*Ts)  (Ts: sample time)

  We have AA filter and AI filter when downsampling and upsampling.
    A(f)=G*(1+b11/z+b12/z/z)/(1+a11/z+a12/z/z)*(1+b21/z+b22/z/z)/(1+a21/z+a22/z/z)       z=exp(i*2*pi*f*Ts)
  Coefficients can be found in /cvs/cds/rtcds/caltech/c1/core/advLigoRTS/src/fe/controller.c.

/* Coeffs for the 2x downsampling (32K system) filter */
static double feCoeff2x[9] =
        {0.053628649721183,
        -1.25687596603711,    0.57946661417301,    0.00000415782507,    1.00000000000000,
        -0.79382359542546,    0.88797791037820,    1.29081406322442,    1.00000000000000};
/* Coeffs for the 4x downsampling (16K system) filter */
static double feCoeff4x[9] =
    {0.014805052402446, 
    -1.71662585474518,    0.78495484219691,   -1.41346289716898,   0.99893884152400,
    -1.68385964238855,    0.93734519457266,    0.00000127375260,   0.99819981588176};


  For 64K system, we expect H=1.

  We also have a delay.
    S(f)=exp(-i*2*pi*f*dt)   (dt: delay time)

  So, total TF we expect is;
    H(f)=a*A(f)^2*D(f)*S(f)
  a is a constant depending on the range of ADC and DAC(I think). Currently, a=1/4.

  We may need to think about TF when upsampling.(D(f) is TF of upsampling 64K to analog)

Result:

  Example plot is attached.
  For other plots and the raw data, see /cvs/cds/caltech/users/yuta/scripts/CDSdelay2/ directory.
  As you can see, TFs are slightly different from what we expect.
  They show ripple we don't understand at near cut off frequency.

  If we ignore the ripple, here is the result of delay time at each condition;

data file    host    FE    IOP        rate    sample time    delay        delay/Ts
c1rms16K.dat    c1sus      c1rms    adcSlave    16K    61.0usec    110.4usec    1.8
c1scx16K.dat    c1iscex    c1scx    adcSlave    16K    61.0usec     85.5usec    1.4
c1tst16K.dat    c1iscex    c1tst    adcSlave    16K    61.0usec     84.3usec    1.4
c1tst32K.dat    c1iscex    c1tst    adcSlave    32K    30.5usec     53.7usec    1.8
c1tst64K.dat    c1iscex    c1tst    adcSlave    64K    15.3usec     38.4usec    2.5

  The delay time shown above does not include the delay of DA. To include, add 7.6usec(Ts/2).

  - delay time is different for different machine
  - number of filters (c1scx has full of filters for ETMX suspension, c1tst has only 2) doen't seem to effect much to delay time
  - higher the sampling rate, larger the (delay time)/(sample time) ratio

Plan:

 - figure out how to run a model without IOP
 - where do the ripples come from?
 - why we didn't see significant ripple at previous measurement on c1sus?

Attachment 1: c1tst16Kdelay.png
c1tst16Kdelay.png
  4302   Tue Feb 15 15:06:25 2011 josephbUpdateCDSCDS todo list for tomorrow morning

Currently, there is a test directory called /opt/rtcds/caltech/c1/new_core where we have the latest svn checkout.  Tomorrow (after everything works), it will become the core directory.

1) Modify on the fb machine the /diskless/root/etc/ld.so.cache file.  This is done by logging into fb, going to /etc/ld.so.conf.d/, modifying epics-x86_64.conf to only have .10 stuff , and running sudo /sbin/ldconfig.  Copy the newly generated /etc/ld.so.cache file to /diskless/root/etc/.

2) Modify the rc.local file on the fb machine in /diskless/root/etc/ to take advantage of the new subscripts and init.d/ start scripts.

3) Add the no_rfm_dma to all the iop models (c1x01,c1x02,c1x03,c1x04,c1x05).

4) Rebuild all front end models with new code.  Install.

5) Build awgtpman and mx_streams with new code.

6) Rerun activateDaq.py (to fix channel names from all the rebuilt code).

7) Double check Burt request files have the switch fix.

8) Restart the front ends.

9)Restart the frame builder.

9) Check channels, exitations, RFM connections.

10) Check Monit is working.

  3036   Wed Jun 2 17:34:33 2010 josephb, alex, valeraUpdateCDSCDS updates

From what I understand, Alex rewrote portions of the framebuilder and testpoint codes and then recompiled them in order to get more than 1 testpoint per front end working.   I've tested up to 5 testpoints at once so far, and it worked.

We also have a new noise component added to the RCG code.  This piece of code uses the random number generator from chapter 7.1 of Numerical Recipies Third Edition to generate uniform numbers from 0 to 1.  By placing a filter bank after it should give us sufficient flexibility in generating the necessary noise types.  We did a coherence test between two instances of this noise piece, and they looked pretty incoherent.  Valera will add a picture of it when it finishe 1000 averages to this elog.

I'm in the process of propagating the old suspension control filters to the new RCG filter banks to give us a starting point.  Tomorrow Valera and I are planning to choose a subset of the plant filters  and put them in, and then work out some initial control filters to correspond to the plant.  I also need to think about adding the anti-aliasing filters and whitening/dewhitening filters.

 

  4445   Mon Mar 28 15:18:04 2011 josephbUpdateCDSCDS updates on Friday

Last Friday, we discovered a bug in the RCG where the delay part was not actually delaying.  We reported this to Alex who promptly put a fix in the same day.  This allowed Matt's newly proposed frequency discriminator to work properly.

It also required a checkout of the latest RCG code (revision 2328), and rebuild of the various codes.  We backed up all the kernel and executables first such as mbuf.ko and awgtpman.

We did the following:

1) Log into the fb machine.

2) Go to /opt/rtcds/caltech/c1/core/advLigoRTS/src/drv/mbuf and run make.  Copy the newly built mbuf.ko file to /diskless/root/modules/2.6.34.1/kernel/drivers/mbuf/mbuf.ko on the fb machine.

3) Use "sudo cp" to copy the newly built mbuf.ko file to /diskless/root/modules/2.6.34.1/kernel/drivers/mbuf/

4) Go to /cvs/cds/rtcds/caltech/c1/core/advLigoRTS/src/gds and run make.

5) Copy the newly built awgtpman executable to /opt/rtcds/caltech/c1/target/gds/bin/

6) Go to /opt/rtcds/caltech/c1/core/advLigoRTS/src/mx_stream/ and run make.

7) Copy the newly built mx_stream executable to /opt/rtcds/caltech/c1/target/fb/

  17074   Wed Aug 10 20:51:14 2022 TegaUpdateComputersCDS upgrade Front-end machine setup

Here is a summary of what needs doing following the chat with Jamie today.

 

Jamie brought over the KVM switch shown in the attachment and I tested all 16 ports and 7 cables and can confirm that they all work as expected.

 

TODO

1. Do a rack space budget to get a clear picture of how many front-ends we can fit into the new rack

2. Look into what needs doing and how much effort would be needed to clear rack 1X7 and use that instead of the new rack. The power down on Friday would present a good opportunity to do this work on Monday, so get the info ready before then. 

3. Start mounting front-ends, KVM and dolphin network switch

4. Add the BOX rack layout to the CDS upgrade page.

Attachment 1: IMG_20220810_171002928.jpg
IMG_20220810_171002928.jpg
Attachment 2: IMG_20220810_171019633.jpg
IMG_20220810_171019633.jpg
  6540   Tue Apr 17 11:05:04 2012 JamieUpdateCDSCDS upgrade in progress

I am continuing to attempt to upgrade the CDS system to RTS 2.5.  Systems will continue to be up and down for the rest of the day.

  6541   Tue Apr 17 19:03:09 2012 JamieUpdateCDSCDS upgrade in progress

Upgrade progresses, but not complete.  There are some relatively minor issues, and one potentially big issue.

All new software has been installed, including the new epics that supports long channel names.

I've been doing a LOT of cleanup.  It was REALLY messy in there.

The new framebuilder/daqd code is running on fb.

Models are compiling with the new RCG and I am able to get them running.  Some of them are not compiling for relatively minor reasons (the simulink models need updating).  I'm also running into compile problems with IOPs that are using the dolphin drivers.

The major issue is that the framebuilder and the models are not syncing their timing, so there's no data collection.  I've spoken to Alex and he and Rolf are going to come over tomorrow to sort it out.  It's possible that we're missing timing hardware that the new code is expecting.

There are still some stability issues I haven't sorted out yet, and I have a lot more cleanup to do.

At this rate I'm going to shoot for being done Thursday.

  11390   Wed Jul 1 19:16:21 2015 JamieSummaryCDSCDS upgrade in progress

The CDS upgrade is now underway

Here's what's happened so far:

  • Installed and linked in all the RTS supporting software packages in /opt/rtapps (only on front end machines and fb):
    controls@c1lsc ~ 2$ find /opt/rtapps/ -mindepth 1 -maxdepth 1 -type l -ls
    12582916    0 lrwxrwxrwx   1 controls 1001           12 Jul  1 13:16 /opt/rtapps/gds -> gds-2.16.3.2
    12603452    0 lrwxrwxrwx   1 controls 1001           10 Jul  1 13:17 /opt/rtapps/fftw -> fftw-3.3.2
    12603451    0 lrwxrwxrwx   1 controls 1001           15 Jul  1 13:16 /opt/rtapps/libframe -> libframe-8.17.2
    12603450    0 lrwxrwxrwx   1 controls 1001           13 Jul  1 13:16 /opt/rtapps/libmetaio -> libmetaio-8.2
    12582915    0 lrwxrwxrwx   1 controls 1001           34 Jul  1 15:24 /opt/rtapps/framecpp -> ldas-tools-1.19.32-p1/linux-x86_64
    12582914    0 lrwxrwxrwx   1 controls 1001           20 Jul  1 13:15 /opt/rtapps/epics -> epics-3.14.12.2_long
  • Checked out the RTS source for the version we'll be using: 2.9.4

/opt/rtcds/rtscore/tags/advLigoRTS-2.9.4

  • built and installed all of the RTS components:
    • mbuf
    • mx_stream
    • daqd
    • nds
    • awgtpman
       
  • mx_stream is not working. Unknown why. It won't start on the front end machines (only tested on c1lsc so far) with the following error:
    controls@c1lsc ~ 1$ /opt/rtcds/caltech/c1/target/fb/mx_stream -s c1x04 c1lsc c1ass c1oaf c1cal -d fb:0
    mmapped address is 0x7ff7b71a0000
    send len = 263596
    mx_connect failed Remote Endpoint is Closed
    controls@c1lsc ~ 1$
    
    Have contact Keith T. and Rolf B. for backup.  This is a blocker, since this is what ferries the data from the front ends.
     
  • Rebuilt almost all models.  This was good.  Initially nothing would compile because of IPC creation errors, so I moved the old chans/ipc/C1.ipc file out of the way and generated a new one and then everything compiled (of course senders have to be compiled before receivers).
    I only had to fix a couple of things in the models themselves:
    • c1ioo - unterminated FiltCtrl inputs
    • C1_SUS_SINGLE_CONTROL - unterminated FiltCtrl inputs
    • c1oaf - bad part named "STATIC". There is some hacky namespace stuff going on in the RCG. I was able to just explode that part and it now works.
    • c1lsc - unterminated FiltCtrl inputs
    Haven't installed or tried to run anything yet, but the fact they compile is good.
    Some models are not compiling because they have C code in src blocks that are throwing errors:
    • c1lsc
    • c1cal
    It shouldn't be too hard to fix whatever is causing those compile errors.

That's it for today.  Will pick up again first thing tomorrow

  6552   Fri Apr 20 19:54:57 2012 JamieUpdateCDSCDS upgrade problems

I ran into a couple of snags today.

A big one is that the framebuilder daqd started going haywire when I told it to start writing frames.  After restart the logs started showing this:

[Fri Apr 20 17:23:40 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:41 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:42 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:43 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:44 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:45 2012] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1019002442 to 1019003041
FATAL: exception not rethrown
FATAL: exception not rethrown
FATAL: exception not rethrown

and the network seemed like it started to get really slow.  I wasn't able to figure out what was going on, so I shut the frame writing off again.  I'll have to work with Rolf on that next week.

Another big problem is the workstation application upgrades.  The NDS protocol version has been incremented, which means that all the NDS client applications have to be upgraded.  The new dataviewer is working fine (on pianosa), but dtt is not:

controls@pianosa:~ 0$ diaggui
diaggui: symbol lookup error: /ligo/apps/linux-x86_64/gds-2.15.1/lib/libligogui.so.0: undefined symbol: _ZN18TGScrollBarElement11ShowMembersER16TMemberInspector
controls@pianosa:~ 127$ 

I don't know what's going on here.  All the library paths are ok.  Hopefully I'll be able to figure this out soon.  The old version of dtt definitely does not work with the new setup.

I might go ahead and upgrade some more of the workstations to Ubuntu in the next couple of days as well, so everything is more on the same page.

I also tried to cleanup the front-end boot process, which has it's own problems (models won't auto-start).  I haven't figured that out yet either.  It really needs to just be completely overhauled.

  6546   Wed Apr 18 19:59:48 2012 JamieUpdateCDSCDS upgrade success

The upgrade is nearly complete:

  • new daqd code is running on fb
  • the fe/daqd timing issue was resolved by adjusting the GPS offset in the daqdrc.  I will document this more later.
  • the power outage conveniently rebooted all the front-end machines, so they're all now running new caRepeater
  • all models have been successfully recompiled with RCG 2.5 (with only a couple small glitches)
  • all new models are running on all front-end machines (with a couple exceptions)
  • all suspension models seem to be damping under local control (PRM is having troubles that are likely unrelated to the upgrade).
  • a lot of cleanup has been done

Remaining tasks/issues:

  • more testing OF EVERYTHING needs o be done
  • I did not yet update the DIS dolphin code, so we're running with the old code.  I don't think this is a problem, but it would be nice to get us running what they're running at the sites
  • I tried to cleanup/simplify how front-end initialization is done.  However, there is a problem and models are not auto-starting after reboot.  This needs to be fixed.
  • the userapps directory is in a new place (/opt/rtcds/userapps).  Not everything in the old location was checked into the repository, so we need to check to make sure everything that needs to be is checked in, and that all the models are running the right code.
  • the c1oaf model seems to be having a dolphin issue that needs to be sorted
  • the c1gfd model causes c1ioo to crash immediately upon being loaded.  I have removed it from the rtsystab.  That model needs to be fixed.
  • general model cleanup is in order.
  • more front-end cleanup is needed, particularly in regards to boot-up procedure.
  • document the entire upgrade procedure.

I'll finish up these remaining tasks tomorrow.

  16881   Fri May 27 17:46:48 2022 PacoSummaryComputersCDS upgrade visit, downfall and rise of c1lsc models

[Paco, Anchal-remote, Yuta, JC]

Sometime around noon today, right after cds upgrade planning tour, c1lsc FE fell. We though this was ok because anyways c1sus was still up, but somehow the IFO alignment was compromised (this is in fact how we first noticed this loss). Yuta couldn't see REFL on the camera, and neither on the AP table (!!) so somehow either/all of TT1, TT2, PRM got affected by this model stopping. We even tried kicking PRM slightly to try and see if the beam was nearby with no success.

We decided to restart the models. To do this we first ssh into c1lsc, c1ioo and c1sus and stop all models. During this step, c1ioo and c1sus dropped their connection and so we had to physically restart them. We then noticed DC 0x4000 error in c1x04 (c1lsc iop) and after checking the gpstimes were different by 1 second. We then did stopped the model again, and from fb1 restart all daqd_* services and modprobe -r gpstime, modprobe gpstime, restart c1lsc and start the c1x04 model. This fixed the issue, so we finished restarting all FE models and burt restore all the relevant snap files to today 02:19 AM PDT.

This made the IFO recover its nominal alignment, minus the usual drift.

* The OAF model failed to start but we left it like so for now.

  11397   Wed Jul 8 21:02:02 2015 JamieSummaryCDSCDS upgrade: another step forward, so we're back to where we started (plus a bit?)

Koji did a bit of googling to determine that 'Wrong Network' status message could be explained by the fb myrinet  operating in the wrong mode:
(This was the useful link to track down the issue (KA))
 

    Network:    Myrinet 10G

I didn't notice it before, but we should in fact be operating in "Ethernet" mode, since that's the fabric we're using for the DC network.  Digging a bit deeper we found that the new version of mx (1.2.16) had indeed been configured with a different compile option than the 1.2.15 version had:

controls@fb ~ 0$ grep '$ ./configure' /opt/src/mx-1.2.15/config.log          
  $ ./configure --enable-ether-mode --prefix=/opt/mx
controls@fb ~ 0$ grep '$ ./configure' /opt/src/mx-1.2.16/config.log
  $ ./configure --enable-mx-wire --prefix=/opt/mx-1.2.16
controls@fb ~ 0$

So that would entirely explain the problem.  I re-linked mx to the older version (1.2.15), reloaded the mx drivers, and everything showed up correctly:

controls@fb ~ 0$ /opt/mx/bin/mx_info
MX Version: 1.2.12
MX Build: root@fb:/root/mx-1.2.12 Mon Nov  1 13:34:38 PDT 2010
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
    8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0:  299.8 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
    Status:        Running, P0: Link Up
    Network:    Ethernet 10G

    MAC Address:    00:60:dd:46:ea:ec
    Product code:    10G-PCIE-8AL-S
    Part number:    09-03916
    Serial number:    352143
    Mapper:        00:60:dd:46:ea:ec, version = 0x00000000, configured
    Mapped hosts:    6

                                                        ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                        P0
-----    -----------     ---------                        ---
   0) 00:60:dd:46:ea:ec fb:0                              1,0
   1) 00:25:90:0d:75:bb c1sus:0                           1,0
   2) 00:30:48:be:11:5d c1iscex:0                         1,0
   3) 00:30:48:d6:11:17 c1iscey:0                         1,0
   4) 00:30:48:bf:69:4f c1lsc:0                           1,0
   5) 00:14:4f:40:64:25 c1ioo:0                           1,0
controls@fb ~ 0$

The front end hosts are also showing good omx info (even though they had been previously as well):

controls@c1lsc ~ 0$ /opt/open-mx/bin/omx_info
Open-MX version 1.5.2
 build: controls@fb:/opt/src/open-mx-1.5.2 Tue May 21 11:03:54 PDT 2013

Found 1 boards (32 max) supporting 32 endpoints each:
 c1lsc:0 (board #0 name eth1 addr 00:30:48:bf:69:4f)
   managed by driver 'igb'

Peer table is ready, mapper is 00:30:48:d6:11:17
================================================
  0) 00:30:48:bf:69:4f c1lsc:0
  1) 00:60:dd:46:ea:ec fb:0
  2) 00:25:90:0d:75:bb c1sus:0
  3) 00:30:48:be:11:5d c1iscex:0
  4) 00:30:48:d6:11:17 c1iscey:0
  5) 00:14:4f:40:64:25 c1ioo:0
controls@c1lsc ~ 0$

This got all the mx_stream connections back up and running.

Unfortunately, daqd is back to being a bit flaky.  With all frame writing enabled we saw daqd crash again.  I then shut off all trend frame writing and we're back to a marginally stable state: we have data flowing from all front ends, and full frames are being written, but not trends.

I'll pick up on this again tomorrow, and maybe try to rebuild the new version of mx with the proper flags.

  11402   Mon Jul 13 01:11:14 2015 JamieSummaryCDSCDS upgrade: current assessment

daqd is still behaving unstably.  It's still unclear what the issue is.

The current failures look like disk IO contention.  However, it's hard to see any evidince of daqd is suffering from large IO wait while it's failing.

The frame size itself is currently smaller than it was before the upgrade:

controls@fb /frames/full 0$ ls -alth 11190 | head
total 369G
drwxr-xr-x 321 controls controls  36K Jul 12 22:20 ..
drwxr-xr-x   2 controls controls 268K Jun 23 06:06 .
-rw-r--r--   1 controls controls  67M Jun 23 06:06 C-R-1119099984-16.gwf
-rw-r--r--   1 controls controls  68M Jun 23 06:06 C-R-1119099968-16.gwf
-rw-r--r--   1 controls controls  69M Jun 23 06:05 C-R-1119099952-16.gwf
-rw-r--r--   1 controls controls  69M Jun 23 06:05 C-R-1119099936-16.gwf
-rw-r--r--   1 controls controls  67M Jun 23 06:05 C-R-1119099920-16.gwf
-rw-r--r--   1 controls controls  68M Jun 23 06:05 C-R-1119099904-16.gwf
-rw-r--r--   1 controls controls  68M Jun 23 06:04 C-R-1119099888-16.gwf
controls@fb /frames/full 0$ ls -alth 11208 | head
total 17G
drwxr-xr-x   2 controls controls  20K Jul 13 01:00 .
-rw-r--r--   1 controls controls  45M Jul 13 01:00 C-R-1120809632-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 01:00 C-R-1120809408-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 00:56 C-R-1120809392-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 00:56 C-R-1120809376-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 00:56 C-R-1120809360-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 00:55 C-R-1120809344-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 00:55 C-R-1120809328-16.gwf
controls@fb /frames/full 0$

This would seem to indicate that it's not an increase in frame size that's to blame.

Because slow data is now transported to daqd over the MX data concentrator network rather than via EPICS (RTS 2.8), there is more network on the MX network.   I note also that the channel lists have increased in size:

controls@fb /opt/rtcds/caltech/c1/chans/daq 0$ ls -alt archive/C1LSC* | head -20
-rw-r--r-- 1 4294967294 4294967294 262554 Jul  6 18:21 archive/C1LSC_150706_182146.ini
-rw-r--r-- 1 4294967294 4294967294 262554 Jul  6 18:16 archive/C1LSC_150706_181603.ini
-rw-r--r-- 1 4294967294 4294967294 262554 Jul  6 16:09 archive/C1LSC_150706_160946.ini
-rw-r--r-- 1 4294967294 4294967294  43366 Jul  1 16:05 archive/C1LSC_150701_160519.ini
-rw-r--r-- 1 4294967294 4294967294  43366 Jun 25 15:47 archive/C1LSC_150625_154739.ini
...

I would have thought, though, that data transmission errors would show up in the daqd status bits.

  11427   Sat Jul 18 15:37:19 2015 JamieSummaryCDSCDS upgrade: current status

So it appears we have found a semi-stable configuration for the DAQ system post upgrade:

Here are the issues:

daqd

dadq is running mostly stably for the moment, although it still crashes at the top of every hour (see below).  Here are some relevant points of about the current configuration:

  • recording data from only a subset of front-ends, to reduce the overall load:
    • c1x01
    • c1scx
    • c1x02
    • c1sus
    • c1mcs
    • c1pem
    • c1x04
    • c1lsc
    • c1ass
    • c1x05
    • c1scy
  • 16 second main buffer:
    start main 16;
  • trend lengths: second: 600, minute: 60
    start trender 600 60;
  • writing to frames:
    • full
    • second
    • minute
    • (NOT raw minute trends)
  • frame compression ON

This elliminates most of the random daqd crashing.  However, daqd still crashes at the top of every hour after writing out the minute trend frame. Still unclear what the issue is, but Keith is investigating.  In some sense this is no worse that where we were before the upgrade, since daqd was also crashing hourly then.  It's still crappy, though, so hopefully we'll figure something out.

The inittab on fb automatically restarts daqd after it crashes, and monit on all of the front ends automatically restarts the mx_stream processes.

front ends

The front end modules are mostly running fine.

One issue is that the execution times seem to have increased a bit, which is problematic for models that were already on the hairy edge.  For instance, the rough aversage for c1sus has some from ~48us to 50us.  This is most problematic for c1cal, which is now running at ~66us out of 60, which is obviously untenable.  We'll need to reduce the load in c1cal somehow.

All other front end models seem to be working fine, but a full test is still needed.

There was an issue with the DACs on c1sus, but I rebooted and everything came up fine, optics are now damped:

  11400   Thu Jul 9 16:50:13 2015 JamieSummaryCDSCDS upgrade: if all else fails try throwing metal at the problem

I roped Rolf into coming over and adding his eyes to the problem.  After much discussion we couldn't come up with any reasonable explanation for the problems we've been seeing other than daqd just needing a lot more resources that it did before.  He said he had some old Sun SunFire X4600s from which we could pilfer memory.  I went over to Downs and ripped all the CPU/memory cards out of one of his machines and stuffed them into fb:

fb now has 8 CPU and 16G of RAM

Unfortunately, this is still not enough.  Or at least it didn't solve the problem; daqd is showing the same instabilities, falling over a couple of minutes after I turn on trend frame writing.  As always, before daqd fails it starts spitting out the following to the logs:

[Thu Jul  9 16:37:09 2015] main profiler warning: 0 empty blocks in the buffer

followed by lines like:

[Thu Jul  9 16:37:27 2015] GPS MISS dcu 44 (ASX); dcu_gps=1120520264 gps=1120519812

right before it dies.

I'm no longer convinced that this is a resource issue, though, judging by the resource usage right before the crash:

top - 16:47:32 up 48 min,  5 users,  load average: 0.91, 0.62, 0.61
Tasks:   2 total,   0 running,   2 sleeping,   0 stopped,   0 zombie
Cpu(s):  8.9%us,  0.9%sy,  0.0%ni, 89.1%id,  0.9%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  15952104k total, 13063468k used,  2888636k free,   138648k buffers
Swap:  1023996k total,        0k used,  1023996k free,  7672292k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12016 controls  20   0 8098m 4.4g 104m S  106 29.1   6:45.79 daqd
 4953 controls  20   0 53580 6092 5096 S    0  0.0   0:00.04 nds

Load average less than 1 per CPU, plenty of free memory (~3G free, 0 swap), no waiting for IO (0.9%wa), etc.  daqd is utilizing lots of  threads, which should be spread across many cpus, so even the >100%CPU should be ok.   I'm at a loss...

  11404   Mon Jul 13 18:12:50 2015 JamieSummaryCDSCDS upgrade: left running in semi-stable configuration

I have been watching daqd all day and I don't feel particularly closer to understanding what the issues are.  However, things are

Interestingly, though, the stability appears highly variable at the moment.  This morning, daqd was very unstable and was crashing within a couple of minutes of starting.  However this afternoon, things seemed much more stable.  As of this moment, daqd has been running for for 25 minutes now, writing full frames as well as minute and second trends (no minute_raw), without any issues.  What has changed?

To reiterate, I have been closing watching disk IO to /frames.  I see no indication that there is any disk contention while daqd is failing.  It's still possible, though, that there are disk IO issues affecting daqd at a level that is not readily visible.  From dstat, the frame writes are visible, but nothing else.

I have made one change that could be positively affecting things right now: I un-exported /frames from NFS.  This eliminates anything external from reading /frames over the network.  In particular, it also shuts off the transfer of frames to LDAS.  Since I've done this, daqd has appeared to be more stable.  It's NOT totally stable, though, as the instance that I described above did eventually just die after 43 minutes, as I was writing this.

In any event, as things are currently as stable as I've seen them, I'm leaving it running in this configuration for the moment, with the following relevant daqdrc parameters:

start main 16;
start frame-saver;
sync frame-saver;
start trender 60 60;
start trend-frame-saver;
sync trend-frame-saver;
start minute-trend-frame-saver;
sync minute-trend-frame-saver;
start profiler;
start trend profiler;
  11406   Tue Jul 14 09:08:37 2015 JamieSummaryCDSCDS upgrade: left running in semi-stable configuration

Overnight daqd restarted itself only about twice an hour, which is an improvement:

controls@fb /opt/rtcds/caltech/c1/target/fb 0$ tail logs/restart.log
daqd: Tue Jul 14 03:13:50 PDT 2015
daqd: Tue Jul 14 04:01:39 PDT 2015
daqd: Tue Jul 14 04:09:57 PDT 2015
daqd: Tue Jul 14 05:02:46 PDT 2015
daqd: Tue Jul 14 06:01:57 PDT 2015
daqd: Tue Jul 14 06:43:18 PDT 2015
daqd: Tue Jul 14 07:02:19 PDT 2015
daqd: Tue Jul 14 07:58:16 PDT 2015
daqd: Tue Jul 14 08:02:44 PDT 2015
daqd: Tue Jul 14 09:02:24 PDT 2015

Un-exporting /frames might have helped a bit.  However, the problem is obviously still not fixed.

  11408   Tue Jul 14 10:28:02 2015 ericqSummaryCDSCDS upgrade: left running in semi-stable configuration

There remains a pattern to some of the restarts, the following times are all reported as restart times. (There are others in between, however.)

daqd: Tue Jul 14 00:02:48 PDT 2015
daqd: Tue Jul 14 01:02:32 PDT 2015
daqd: Tue Jul 14 03:02:33 PDT 2015
daqd: Tue Jul 14 05:02:46 PDT 2015
daqd: Tue Jul 14 06:01:57 PDT 2015
daqd: Tue Jul 14 07:02:19 PDT 2015
daqd: Tue Jul 14 08:02:44 PDT 2015
daqd: Tue Jul 14 09:02:24 PDT 2015
daqd: Tue Jul 14 10:02:03 PDT 2015

Before the upgrade, we suffered from hourly crashes too:

daqd_start Sun Jun 21 00:01:06 PDT 2015
daqd_start Sun Jun 21 01:03:47 PDT 2015
daqd_start Sun Jun 21 02:04:04 PDT 2015
daqd_start Sun Jun 21 03:04:35 PDT 2015
daqd_start Sun Jun 21 04:04:04 PDT 2015
daqd_start Sun Jun 21 05:03:45 PDT 2015
daqd_start Sun Jun 21 06:02:43 PDT 2015
daqd_start Sun Jun 21 07:04:42 PDT 2015
daqd_start Sun Jun 21 08:04:34 PDT 2015
daqd_start Sun Jun 21 09:03:30 PDT 2015
daqd_start Sun Jun 21 10:04:11 PDT 2015

So, this isn't neccesarily new behavior, just something that remains unfixed. 

  11409   Tue Jul 14 11:57:27 2015 jamieSummaryCDSCDS upgrade: left running in semi-stable configuration
Quote:

There remains a pattern to some of the restarts, the following times are all reported as restart times. (There are others in between, however.)

daqd: Tue Jul 14 00:02:48 PDT 2015
daqd: Tue Jul 14 01:02:32 PDT 2015
daqd: Tue Jul 14 03:02:33 PDT 2015
daqd: Tue Jul 14 05:02:46 PDT 2015
daqd: Tue Jul 14 06:01:57 PDT 2015
daqd: Tue Jul 14 07:02:19 PDT 2015
daqd: Tue Jul 14 08:02:44 PDT 2015
daqd: Tue Jul 14 09:02:24 PDT 2015
daqd: Tue Jul 14 10:02:03 PDT 2015

Before the upgrade, we suffered from hourly crashes too:

daqd_start Sun Jun 21 00:01:06 PDT 2015
daqd_start Sun Jun 21 01:03:47 PDT 2015
daqd_start Sun Jun 21 02:04:04 PDT 2015
daqd_start Sun Jun 21 03:04:35 PDT 2015
daqd_start Sun Jun 21 04:04:04 PDT 2015
daqd_start Sun Jun 21 05:03:45 PDT 2015
daqd_start Sun Jun 21 06:02:43 PDT 2015
daqd_start Sun Jun 21 07:04:42 PDT 2015
daqd_start Sun Jun 21 08:04:34 PDT 2015
daqd_start Sun Jun 21 09:03:30 PDT 2015
daqd_start Sun Jun 21 10:04:11 PDT 2015

So, this isn't neccesarily new behavior, just something that remains unfixed. 

That's interesting, that we're still seeing those hourly crashes.

We're not writing out the full set of channels, though, and we're getting more failures than just those at the hour, so we're still suffering.

  11398   Thu Jul 9 13:26:47 2015 JamieSummaryCDSCDS upgrade: new mx 1.2.16 installed

I rebuilt/installed mx 1.2.16 to use "ether-mode", instead of the default MX-10G:

controls@fb /opt/src/mx-1.2.16 0$ ./configure --enable-ether-mode --prefix=/opt/mx-1.2.16
...
controls@fb /opt/src/mx-1.2.16 0$ make
..
controls@fb /opt/src/mx-1.2.16 0$ make install
...

I then rebuilt/installed daqd so that it properly linked against the updated mx install:

controls@fb /opt/rtcds/rtscore/release/src/daqd 0$ ./configure --enable-debug --disable-broadcast --without-myrinet --with-mx --with epics=/opt/rtapps/epics/base --with-framecpp=/opt/rtapps/framecpp --enable-local-timing
...
controls@fb /opt/rtcds/rtscore/release/src/daqd 0$ make
...
controls@fb /opt/rtcds/rtscore/release/src/daqd 0$ install daqd /opt/rtcds/caltech/c1/target/fb/

It's now back to running and receiving data from the front ends (still not stable yet, though).

  11396   Wed Jul 8 20:37:02 2015 JamieSummaryCDSCDS upgrade: one step forward, two steps back

After determining yesterday that all the daqd issues were coming from the frame writing, I started to dig into it more today.  I also spoke to Keith Thorne, and got some good suggestions from Gerrit Kuhn at GEO.

I  realized that it probably wasn't the trend writing per se, but that turning on more writing to disk was causing increased load on daqd, and consequently on the system itself.  With more frame writing turned on the memory consuption increased to the point of maxing out the physical RAM.  The system the probably starting swaping, which certainly would have choked daqd.

I noticed that fb only had 4G of RAM, which Keith suggested was just not enough.  Even if the memory consumption of daqd has increased significantly, it still seems like 4G would not be enough.  I opened up fb only to find that fb actually had 8G of RAM installed!  Not sure what happend to the other 4G, but somehow they were not visible to the system.  Koji and I eventually determined, via some frankenstein operations with megatron, that the RAM was just dead.  We then pulled 4G of RAM from megatron and replaced the bad RAM in fb, so that fb now has a full 8G of RAM cool.

Unfortunately, when we got fb fully back up and running we found that fb is not able to see any of the other hosts on the data concentrator network sad.  mx_info, which displays the card and network status for the myricom myrinet fiber card, shows:

MX Version: 1.2.16
MX Build: controls@fb:/opt/src/mx-1.2.16 Tue May 21 10:58:40 PDT 2013
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
    8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0:  299.8 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
    Status:        Running, P0: Wrong Network
    Network:    Myrinet 10G

    MAC Address:    00:60:dd:46:ea:ec
    Product code:    10G-PCIE-8AL-S
    Part number:    09-03916
    Serial number:    352143
    Mapper:        00:60:dd:46:ea:ec, version = 0x63e745ee, configured
    Mapped hosts:    1

                                                        ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                        P0
-----    -----------     ---------                        ---
   0) 00:60:dd:46:ea:ec fb:0                            D 0,0

Note that all front end machines should be listed in the table at the bottom, and they're not.   Also note the "Wrong Network" note in the Status line above.  It appears that the card has maybe been initialized in a bad state?  Or Koji and I somehow disturbed the network when we were cleaning up things in the rack.  "sudo /etc/init.d/mx restart" on fb doesn't solve the problem.  We even rebooted fb and it didn't seem to help.

In any event, we're back to no data flow.  I'll pick up again tomorrow.

  11412   Tue Jul 14 16:51:01 2015 JamieSummaryCDSCDS upgrade: problem is not disk access

I think I have now determined once and for all that the daqd problems are NOT due to disk IO contention.

I have mounted a tmpfs at /frames/tmp and have told daqd to write frames there.  The tmpfs exists entirely in RAM.  There is essentially zero IO wait for such a filesystem, so daqd should never have trouble writing out the frames.

But yet daqd continues to fail with the "0 empty blocks in the buffer" warnings.  I've been down a rabbit hole.

  11393   Tue Jul 7 18:27:54 2015 JamieSummaryCDSCDS upgrade: progress!

After a couple of days of struggle, I made some progress on the CDS upgrade today:

Front end status:

  • RTS upgraded to 2.9.4, and linked in as "release":

/opt/rtcds/rtscore/release -> tags/advLigoRTS-2.9.4

  • mbuf kernel module built installed
  • All front ends have been rebooted with the latest patched kernel (from 2.6 upgrade)
  • All models have been rebuilt, installed, restarted.  Only minor model issues had to be corrected (unterminated unused inputs mostly).
  • awgtpman rebuilt, and installed/running on all front-ends
  • open-mx upgraded to 1.5.2:

/opt/open-mx -> open-mx-1.5.2

  • All front ends running latest version of mx_stream, built against 2.9.4 and open-mx-1.5.2.

We have new GDS overview screens for the front end models:

It's possible that our current lack of IRIG-B GPS distribution means that the 'TIM' status bit will always be red on the IOP models.  Will consult with Rolf.

There are other new features in the front ends that I can get into later.

DAQ (fb) status:

  • daqd and nds rebuilt against 2.9.4, both now running on fb

40m daqd compile flags:

cd src/daqd
./configure --enable-debug --disable-broadcast --without-myrinet --with-mx --enable-local-timing --with-epics=/opt/rtapps/epics/base --with-framecpp=/opt/rtapps/framecpp
make
make clean
install daqd /opt/rtcds/caltech/c1/target/fb/

However, daqd has unfortunately been very unstable, and I've been trying to figure out why.  I originally thought it was some sort of timing issue, but now I'm not so sure.

I had to make the following changes to the daqdrc:

set gps_leaps = 820108813 914803214 1119744016;

That enumerates some list of leap seconds since some time.  Not sure if that actually does anything, but I added the latest leap seconds anyway:

set symm_gps_offset=315964803;

This updates the silly, arbitrary GPS offset, that is required to be correct when not using external GPS reference.

Finally, the last thing I did that finally got it running stably was to turn off all trend frame writing:

# start trender;
# start trend-frame-saver;
# sync trend-frame-saver;
# start minute-trend-frame-saver;
# sync minute-trend-frame-saver;
# start raw_minute_trend_saver;

For whatever reason, it's the trend frame writing that that was causing things daqd to fall over after a short amount of time.  I'll continue investigating tomorrow.

 

We still have a lot of cleanup burt restores, testing, etc. to do, but we're getting there.

  11415   Wed Jul 15 13:19:14 2015 JamieSummaryCDSCDS upgrade: reducing mx end-points as last ditch effort

I tried one last thing, suggested by Keith and Gerrit.  I tried reducing the number of mx end-points on fb to zero, which should reduce the total number of fb threads, in the hope that the extra threads were causing the chokes.

On Tue, Jul 14 2015, Keith Thorne <kthorne@ligo-la.caltech.edu> wrote:
> Assumptions
>  1) Before the upgrade (from RCG 2.6?), the DAQ had been working, reading out front-ends, writing frames trends
>  2) In upgrading to RCG 2.9, the mx start-up on the frame builder was modified to use multiple end-points
> (i.e. /etc/init.d/mx has a line like
> # 1 10G card - X2
> MX_MODULE_PARAMS="mx_max_instance=1 mx_max_endpoints=16 $MX_MODULE_PARAMS"
>  (This can be confirmed by the daqd log file with lines at the top like
> 263596
> MX has 16 maximum end-points configured
> 2 MX NICs available
> [Fri Jul 10 16:12:50 2015] ->4: set thread_stack_size=10240
> [Fri Jul 10 16:12:50 2015] new threads will be created with the stack of size 10
> 240K
>
> If this is the case, the problem may be that the additional thread on the frame-builder (one per end-point) take up so many slots on the 8-core
> frame-builder that they interrupt the frame-writing thread, thus preventing the main buffer from being emptied.  
>
> One could go back to a single end-point. This only helps keep restart of front-end A from hiccuping DAQ for front-end B.
>
> You would have to remove code on front-ends (/etc/init.d/mx_stream) that chooses endpoints. i.e.
> # find line number in rtsystab. Use that to mx_stream slot on card (0-15)
> line_num=`grep -v ^# /etc/rtsystab | grep --perl-regexp -n "^${hostname}\s" | se
> d 's/^\([0-9]*\):.*/\1/g'`
> line_off=$(expr $line_num - 1)
> epnum=$(expr $line_off % 2)
> cnum=$(expr $line_off / 2)
>
>     start-stop-daemon --start --quiet -b -m --pidfile /var/log/mx_stream0.pid --exec /opt/rtcds/tst/x2/target/x2daqdc0/mx_stream -- -e 0 -r "$epnum" -W 0 -w 0 -s "$sys" -d x2daqdc0:$cnum -l /opt/rtcds/tst/x2/target/x2daqdc0/mx_stream_logs/$hostname.log

As per Keith's suggestion, I modified the mx startup script to only initialize a single endpoint, and I modified the mx_stream startup to point them all to endpoint 0.  I verified that indeed daqd was a single MX end-point:

MX has 1 maximum end-points configured

It didn't help.  After 5-10 minutes daqd crashes with the same "0 empty blocks" messages.

I should also mention that I'm pretty sure the start of these messages does not seem coincident with any frame writing to disk; further evidence that it's not a disk IO issue.

Keith is looking at the system now, so we if he can see anything obvious.  If not, I will start reverting to 2.5.

  11417   Wed Jul 15 18:19:12 2015 JamieSummaryCDSCDS upgrade: tentative stabilty?

Keith Thorne provided his eyes on the situation today and had some suggestions that might have helped things

Reorder ini file list in master file.  Apparently the EDCU.ini file (C0EDCU.ini in our case), which describes EPICS subscriptions to be recorded by the daq, now has to be specified *after* all other front end ini files.  It's unclear why, but it has something to do with RTS 2.8 which changed all slow channels to be transported over the mx network.  This alone did not fix the problem, though.

Increase second trend frame size.  Interestingly, this might have been the key.  The second trend frame size was increased to 600 seconds:

start trender 600 60;

The two numbers are the lengths in seconds for the second and minute trends respectively.  They had been set to "60 60", but Keith suggested that longer second trend frames are better, for whatever reason.  It seems he may be right, given that daqd has been running and writing full and trend frames for 1.5 hours now without issue. 


As I'm writing this, though, the daqd just crashed again.  I note, though, that it's right after the hour, and immediately following writing out a one hour minute trend file.  We've been seeing these hour, on the hour, crashes of daqd for quite a while now.  So maybe this is nothing new.  I've actually been wondering if the hourly daqd crashes were associated with writing out the minute trend frames, and I think we might have more evidence to point to that.

If increasing the size of the second trend frames from 60 seconds (35M) to 600 seconds (70M) made a difference in stability, could there be an issue since writing out files that are smaller than some value?  The full frames are 60M, and the minute trends are 35M.

  8547   Tue May 7 23:03:12 2013 KojiConfigurationCDSCDS work

Summary:

c1rfm / c1lsc / c1ass / c1sus were modified. They were recomplied and installed. They are running fine
and confirmed PRMI locking (attempt), arm locking, and Yarm ass with the new codes.

Motivation:

1a. SQRTing switching for POP110 was wrong. 0 enabled sqrting, 1 disabled sqrting. I wanted to fix this.
1b. Sqrting for POP22 was not implemented.

2. Preparation for the shadow sensor control with POPDC.

3. ASS had only an input. I want to run two ASS for the X and Y arms.

SQRTing for POP110/22:

- Flipped the input of the bypass switch. Correspoding MEDM indicators are fixed on the power normalization screen.
- Copied the sqrting structure from POP110 to POP22. Correspoding MEDM buttom was made on the power normalization screen.

- The function of the sqrting buttons were confirmed.

Additional ASS output:

- The output path "NPRO" was removed. Corresponding RFM channels have also removed.
- The previous NPRO path was turned to the "ASS1" path. The previous "ASS" path was turned to "ASS2".
- Corresponding shared memory channel are created/renamed.
- c1ass was modified to receive the new ASS shared memory channels. ASS1 is assigned to the X arm. ASS2 is assigned to the Y arm
- The output matrix screen and the lockin screen were modified accordingly.
- Only script/ASS/Arm_ASS_Setup.py was affected. The corespoding lines (matrix assignment) was fixed.

- The function of Den's version of  ASS was confirmed.

LSC->PRM ASC path

- We want to connect POPDC to PRM ASC. POPDC is acquired on c1lsc.
- So, for now we use the LSC input matrix to assign POPDC to one of the servo bank.
- The last row of the LSC output matrix was assigned to the PCIE connection to c1sus.
- This PCIE connection was connected to the PRM ASC YAW input.

- The connection between LSC and SUS was confirmed.

- During this process I found that there are bunch of channels transferred from LSC to SUS via RFM.
  These channels are transferred via PCIE(dolphin) and then via RFM. But LSC and SUS are connected
  with dolphin. So this just adds additional sampling delay while there is no benefit. I think we should remove the RFM part.
  Note that we need to use RFM for the end mirrors but this also should use only the RFM connection.


Rebuilding the codes

- Prior to the tests of the new functionalities, the codes were rebuild/installed as usual.
- The suspension were shutdown with the watch dogs before the restart of the realtime codes.
- Once the realtime codes were restarted successfully, the watch dogs were reloaded.
- As we removed/added the channels, fb was restarted.
- c1rfm / c1lsc / c1ass / c1sus codes were checked-in to svn
 

  11221   Wed Apr 15 20:54:18 2015 JenneUpdateComputer Scripts / ProgramsCDSutils upgrade bad

The SUS align/misalign scripts don't work after the new CDS utils upgrade. 

I don't know if it's looking for the _SWSTAT channel to confirm that the offset has been turned on/off, or if it is trying to set that channel, to do the switching, but either way, the script is failing.  Recall that our version of the RCG still has _SW1R and _SW2R, rather than the newer _SWSTAT for the filter banks. 

ezca.ezca.EzcaConnectError: Could not connect to channel (timeout=2s): C1:SUS-PRM_OL_PIT_SWSTAT

Q, can you please (please, please, pretty please) undo this upgrade, and then hold off on any further changes to the system for a few weeks?

  11223   Wed Apr 15 23:29:08 2015 JenneUpdateComputer Scripts / ProgramsCDSutils upgrade undone

Q remotely reverted this change.  Scripts seem to work again.

Quote:

The SUS align/misalign scripts don't work after the new CDS utils upgrade. 

I don't know if it's looking for the _SWSTAT channel to confirm that the offset has been turned on/off, or if it is trying to set that channel, to do the switching, but either way, the script is failing.  Recall that our version of the RCG still has _SW1R and _SW2R, rather than the newer _SWSTAT for the filter banks. 

ezca.ezca.EzcaConnectError: Could not connect to channel (timeout=2s): C1:SUS-PRM_OL_PIT_SWSTAT

Q, can you please (please, please, pretty please) undo this upgrade, and then hold off on any further changes to the system for a few weeks?

 

  11240   Thu Apr 23 21:05:23 2015 ranaUpdateComputer Scripts / ProgramsCDSutils upgrade undone

Q: please update this Wiki page with the go-back procedure:

https://wiki-40m.ligo.caltech.edu/CDSutils_Upgrade_Procedure

  11220   Wed Apr 15 15:14:18 2015 ericqUpdateComputer Scripts / ProgramsCDSutils upgraded to v474

CDSutlils has been updated to the newest version, 474; there are some matrix interface methods that will make our locking scripts easier to read, modify, and maintain.

I've tested the ALS and CARM down scripts, and the LSC offsets script, and they all work fine. 

ELOG V3.1.3-