40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 44 of 348  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  5896   Tue Nov 15 15:56:23 2011 jamieUpdateCDSdataviewer doesn't run

Quote:

Dataviewer is not able to access to fb somehow.

I restarted daqd on fb but it didn't help.

Also the status screen is showing a blank while form in all the realtime model. Something bad is happening.

 So something very strange was happening to the framebuilder (fb).  I logged on the fb and found this being spewed to the logs once a second:

[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory

Apparently /bin/gcore was trying to be called by some daqd subprocess or thread, and was failing since that file doesn't exist.  This apparently started at around 5:52 AM last night:

[Tue Nov 15 05:46:52 2011] main profiler warning: 1 empty blocks in the buffer
[Tue Nov 15 05:46:53 2011] main profiler warning: 0 empty blocks in the buffer
[Tue Nov 15 05:46:54 2011] main profiler warning: 0 empty blocks in the buffer
[Tue Nov 15 05:46:55 2011] main profiler warning: 0 empty blocks in the buffer
[Tue Nov 15 05:46:56 2011] main profiler warning: 0 empty blocks in the buffer
...
[Tue Nov 15 05:52:43 2011] main profiler warning: 0 empty blocks in the buffer
[Tue Nov 15 05:52:44 2011] main profiler warning: 0 empty blocks in the buffer
[Tue Nov 15 05:52:45 2011] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1005400026 to 1005400379
[Tue Nov 15 05:52:46 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 05:52:46 2011] going down on signal 11
sh: /bin/gcore: No such file or directory

The gcore I believe it's looking for is a debugging tool that is able to retrieve images of running processes.  I'm guessing that something caused something int the fb to eat crap, and it was stuck trying to debug itself.  I can't tell what exactly happend, though.  I'll ping the CDS guys about it.  The daqd process was continuing to run, but it was not responding to anything, which is why it could not be restarted via the normal means, and maybe why the various FB0_*_STATUS channels were seemingly dead.

I manually killed the daqd process, and monit seemed to bring up a new process with no problem.  I'll keep an eye on it.

  5901   Tue Nov 15 23:44:44 2011 MirkoUpdateCDSC1:LSC & C1:SUS restarted

Earlier this evening C1:LSC died then I hit the DAQ reload after adding an OAF channel to be recorded. No change to any model. Had to restart C1:SUS too. Reloaded burts from this morning 5am, except for C1:IOO, which I loaded from 16:07.

  5938   Fri Nov 18 01:12:14 2011 SureshUpdateCDSMC1 LR dead for > 1 month; now revived temporarily

[Den, Mirko, Suresh]

    We were investigating why there is no correlation between MC1 osem signals and seismic motion.   During this we noticed a recurrence of this old problem of MC1_LR sensor being dead.  I went and pressed down the chip holders where the AA filters used to sit and which now hold the jumper wire.  The board is large and flexible it is quite likely some solder joint is broken on the MC1_LR path on this board.

   The signal came back to life and is okay now. But it can break off again any time.

 

 

Quote:

 Since the MC1 LRSEN channel is not wasn't working, my input matrix diagonalization hasn't worked today wasn't working. So I decided to fix it somehow.

I went to the rack and traced the signal: first at the LEMO monitor on the whitening card, secondly at the 4-pin LEMO cable which goes into the AA chassis.

The signal existed at the input to the AA chassis but not in the screen. So I pressed the jumper wire (used to be AA filter) down for the channel corresponding to the MC1 LRSEN channel.

It now has come back and looks like the other sensors. As you can see from this plot and Joe's entry from a couple weeks ago, this channel has been dead since May 17th.

The ELOG reveals that Kiwamu caught Steve doing some (un-elogged) fooling around there. Burnt Toast -> Steve.

bt.jpg

993190663   =      free swinging ringdown restarted again

 

  5971   Mon Nov 21 17:07:34 2011 MirkoUpdateCDSc1pem model dead

For some reason C1PEM doesn't seem to work anymore after a recompilation. It did recompile fine. We just changed some channel / subsystem names.

Tried reverting to the svn version. Doesn't work. Reboot C1SUS also no good.

  5973   Mon Nov 21 22:51:55 2011 MirkoUpdateCDSc1pem model dead

Quote:

For some reason C1PEM doesn't seem to work anymore after a recompilation. It did recompile fine. We just changed some channel / subsystem names.

Tried reverting to the svn version. Doesn't work. Reboot C1SUS also no good.

 It is fine again. Thanks Jamie.

  5979   Tue Nov 22 18:15:39 2011 jamieUpdateCDSc1iscex ADC found dead. Replaced, c1iscex working again

c1iscex has not been running for a couple of days (since the power shutdown at least). I was assuming that the problem was recurrence of the c1iscex IO chassis issue from a couple weeks ago (5854).  However, upon investigation I found that the timing signals were all fine.  Instead, the IOP was reporting that it was finding now ADC, even though there is one in the chassis.

Since I had a spare ADC that was going to be used for the CyMAC, I decided to try swapping it out to see if that helped.  Sure enough, the system came up fine with the new ADC.  The IOP (c1x01) and c1scx are now both running fine.

I assume the issue before might have been caused by a failing and flaky ADC, which has now failed.  We'll need to get a new ADC for me to give back to the CyMAC.

  6002   Thu Nov 24 15:27:15 2011 kiwamuUpdateCDSc1iscey hardware rebooted
The c1iscey machine crashed around 1:00 AM last night and I did a hard-ware reboot by pressing a button on the front panel of the machine.
After the reboot its been running okay so far.
The crash happened after I pressed the "Diag Reset" button on the CDS status screen.
  6011   Fri Nov 25 22:11:12 2011 MirkoUpdateCDSBeware of fancy filter modules

[Rana, Den, Mirko]

It seems you can shoot yourself in the foot if your filter modules are too complex.

Den discovered this when looking into the C1:SUS-MC?_SUSPOS filter module named Cheby, consisting of cheby1("LowPass",6,1,12)cheby1("LowPass",2,0.1,3)gain(1.13501) by noticing that the coherence between input and output of the filter is low.

Cheby filter:

Cheby.png

CoherenceCheby.pdf

This is most likely due to the filter spanning more than the 16 orders of precision that the double data type spans.

The coherence is fine when one splits the filter in two, giving every cheby1 filter its own module. The coherence is also fine when you use the Cheby filter in a 2kHz system, although the freq. response looks very odd

Black: 16kHz, Red 2kHz (yes the filter was converted correctly, no text file editing there)

ChebyAt16kHzBlackand2kHzRed.png

The problem occurs on c1lsc as well as c1sus computer.

 

Looking into the foton files actually points to a precision problem, with the huge range of scale covered in there:

C1:MCS 16kHz (Cheby: Original filter with low coherence. CHbyTST & ChebyTST: Original filter split amongst two filter modules)
################################################################################
### SUS_MC3_LSC                                                              ###
################################################################################
# DESIGN   SUS_MC3_LSC 0 zpk([0],[30],0.333333,"n")
# DESIGN   SUS_MC3_LSC 1 cheby1("LowPass",6,1,12)
# DESIGN   SUS_MC3_LSC 2 cheby1("LowPass",2,0.1,3)gain(1.13501) \
#                       
# DESIGN   SUS_MC3_LSC 3 cheby1("LowPass",2,0.1,3)gain(1.13501)cheby1("LowPass",6,1,12)
# DESIGN   SUS_MC3_LSC 4 ellip("BandStop",4,1,40,16.1,16.9)ellip("BandStop",4,1,40,23.7,24.5)gain(1.25871)
###                                                                          ###
SUS_MC3_LSC 0 12 1  32768      0 30:0.0          9.942903833923793  -0.9885608209680459   0.0000000000000000  -1.0000000000000000   0.0000000000000000
SUS_MC3_LSC 1 21 3      0      0 CHbyTST     9.095012702673064e-18  -1.9978637592754149   0.9978663974923444   2.0000000000000000   1.0000000000000000
                                                                 -1.9984258494490537   0.9984376515442090   2.0000000000000000   1.0000000000000000
                                                                 -1.9994068831713223   0.9994278587363880   2.0000000000000000   1.0000000000000000
SUS_MC3_LSC 2 12 1  32768      0 ChebyTST    1.228759186937126e-06  -1.9972699801052749   0.9972743606395355   2.0000000000000000   1.0000000000000000
SUS_MC3_LSC 3 12 4  32768      0 Cheby       1.117558041371939e-23  -1.9972699801052749   0.9972743606395355   2.0000000000000000   1.0000000000000000
                                                                 -1.9978637592754149   0.9978663974923444   2.0000000000000000   1.0000000000000000
                                                                 -1.9984258494490537   0.9984376515442090   2.0000000000000000   1.0000000000000000
                                                                 -1.9994068831713223   0.9994278587363880   2.0000000000000000   1.0000000000000000
SUS_MC3_LSC 4 12 8  32768      0 BounceRoll     0.9991466189294013  -1.9996634951844035   0.9997010181703262  -1.9999611719719754   0.9999999999999997
                                                                 -1.9999303040590390   0.9999684339228864  -1.9999605309876360   0.9999999999999999
                                                                 -1.9999248796830529   0.9999668732412945  -1.9999594299327190   1.0000000000000002
                                                                 -1.9996385459838455   0.9996812069238987  -1.9999587601905868   1.0000000000000000
                                                                 -1.9996161812709703   0.9996978939989944  -1.9999163485656493   0.9999999999999999
                                                                 -1.9998855694973159   0.9999681878303275  -1.9999154056705493   0.9999999999999998
                                                                 -1.9998788577090287   0.9999671193335300  -1.9999137972442669   1.0000000000000000
                                                                 -1.9995951159123118   0.9996843310430819  -1.9999128255920269   1.0000000000000000

C1:OAF 2kHz
###############################################################################
### YARM_IN                                                                  ###
################################################################################
# DESIGN   YARM_IN 0 zpk([0],[30],0.333333,"n")
# DESIGN   YARM_IN 3 cheby1("LowPass",6,1,12)cheby1("LowPass",2,0.1,3)gain(1.13501)
# DESIGN   YARM_IN 4 ellip("BandStop",4,1,40,16.1,16.9)ellip("BandStop",4,1,40,23.7,24.5)gain(1.25871)
# DESIGN   YARM_IN 8 cheby1("LowPass",6,1,12)cheby1("LowPass",2,0.1,3)gain(1.13501)zpk([],[10],1,"n")
###                                                                          ###
YARM_IN  0 12 1   4096      0 30:0.0           9.56649943398763  -0.9119509539166185   0.0000000000000000  -1.0000000000000000   0.0000000000000000
YARM_IN  3 12 4   4096      0 Cheby       1.829878084970283e-16  -1.9828889048300398   0.9830565293861987   2.0000000000000000   1.0000000000000000
                                                                 -1.9868188576622443   0.9875701115261976   2.0000000000000000   1.0000000000000000
                                                                 -1.9940934073784453   0.9954330165532327   2.0000000000000000   1.0000000000000000
                                                                 -1.9781245722853238   0.9784022621062476   2.0000000000000000   1.0000000000000000

Attachment 1: ChebyTST3.png
ChebyTST3.png
  6013   Sat Nov 26 02:05:43 2011 MirkoUpdateCDSBeware of fancy filter modules

 

We replaced the complicated Cheby filter module with three separate filter modules. Probably the filter doesn't need to be so complicated, but rather not change too many things at once. The new filter modules are called:
Ch1, Ch2, Ch3 and are in filter module 3,9, and 10 of the C1:SUS-MC?_SUSPOS filters. The coherence with these filters is fine. Someone should look into the possibility of simplifying these filters.

It would be good to check for numerical problems in other filters!

  6017   Sat Nov 26 10:55:40 2011 ranaUpdateCDSBeware of fancy filter modules

 

 Could be that what we're seeing is the noise floor of the Direct Form II filter structure (see Matt's 2008 elog) which shows an example (also see G0900928-v1 ).

 

  6020   Mon Nov 28 06:53:30 2011 kiwamuUpdateCDSc1sus shutdown

I have restarted the c1sus machine around 9:00 PM yesterday and then shut it down around 4:00 AM this morning after a little bit of taking care of the interferomter.

Quote from #6016

c1sus has been shutdown so that the optics dont bang around.  This is because the watch dogs are not working.

  6026   Mon Nov 28 16:46:55 2011 kiwamuUpdateCDSc1sus is now up

I have restarted the c1sus machine and burt-restored c1sus and c1mcs to the day before Thank giving, namely 23rd of November.

Quote from #6020

I have restarted the c1sus machine around 9:00 PM yesterday and then shut it down around 4:00 AM this morning after a little bit of taking care of the interferometer.

  6030   Mon Nov 28 19:24:51 2011 JenneUpdateCDSBeware of fancy filter modules

[Rana, Jenne]

Some of the funniness is some kind of mysterious interaction between 2 filter modules in the filter banks.  Just FM1 (30:0.0) or just FM4 (Cheby, which is 2 cheby1's) has reasonable coherence.  Both FM1 and FM4 together doesn't do so well - the coherence goes way down.

Just FM1 (30:0.0)

SUSPOS_ETMY_30and0_measured_vs_idealTF.pdf

Just FM4 (Cheby)

SUSPOS_ETMY_Cheby_measured_vs_idealTF.pdf

 Both FM1 and FM4

SUSPOS_ETMY_30and0andCheby_measured_vs_idealTF.pdf

 All the coherences plotted together

SUSPOS_ETMY_30and0andCheby_compareCoherence.pdf

You'd think that the signal encounters FM1, gets filtered, and that result is the signal sent to the next active filter module, FM4, so the 2 filter modules shouldn't interact.  But clearly there's some funny business here since engaging both makes things crappy. 

Matlab investigations to replicate this behavior offline are in progress.

Attachment 4: SUSPOS_ETMY_30and0andCheby_compareCoherence.pdf
SUSPOS_ETMY_30and0andCheby_compareCoherence.pdf
  6031   Mon Nov 28 22:09:24 2011 ranaUpdateCDSBeware of fancy filter modules

To see what might be causing the problem, I used a version of the filter noise test matlab code that Matt had in the elog.

To see if it was a single precision problem, I just recast the input data:   x = single(x)

This is not strictly correct, since some of the rest of the operations are as double precision, but I think that attached plot shows that a casting from double to single is close to the right amount of noise to explain our excess noise problem in the 0.1-1 Hz region.

Den is going to interview Alex to find out if we have some kind of issue like this. My understanding was that all of our filter module calculations were being done in double precision (64 bit), but its possible that some single stuff has crept back in. Currently the FIR filtering code IS single precision and in the past, the SUS code which didn't carry the LSC signals (meaning ASC and damping) were done in single precision.

Attachment 1: noise.pdf
noise.pdf
  6033   Tue Nov 29 04:47:49 2011 kiwamuUpdateCDSc1sus shut down again

I have shut down the c1sus machine at 3:30 AM.

  6037   Tue Nov 29 15:30:01 2011 jamieUpdateCDSlocation of currently used filter function

So I tracked down where the currently-used filter function code is defined (the following is all relative to /opt/rtcds/caltech/c1/core/release):

Looking at one of the generated front-end C source codes (src/fe/c1lsc/c1lsc.c) it looks like the relevant filter function is:

filterModuleD()

which is defined in:

src/include/drv/fm10Gen.c

and an associated header file is:

src/include/fm10Gen.h
  6038   Tue Nov 29 15:57:43 2011 DenUpdateCDSlocation of currently used filter function

 

We are interested in the following question : Can the structures defined in fm10Gen.h (or some other *.c *.h files with defined as FLOAT variables) create single precision instead of double in the filter calculations?

 

typedef struct FM_OP_IN{
  UINT32 opSwitchE;     /* Epics Switch Control Register; 28/32 bits used*/
  UINT32 opSwitchP;     /* PIII Switch Control Register; 28/32 bits used*/
  UINT32 rset;          /* reset switches */
  float offset;         /* signal offset */
  float outgain;        /* module gain */
  float limiter;        /* used to limit the filter output to +/- limit val */
  int rmpcmp[FILTERS];  /* ramp counts: ramps on a filter for type 2 output*/
                        /* comparison limit: compare limit for type 3 output*/
                        /* not used for type 1 output filter */
  int timeout[FILTERS]; /* used to timeout wait in type 3 output filter */
  int cnt[FILTERS];     /* used to keep track of up and down cnt of rmpcmp */
                        /* should be initialized to zero */
  float gain_ramp_time; /* gain change ramping time in seconds */
} FM_OP_IN;  

 

  6042   Tue Nov 29 18:54:29 2011 kiwamuUpdateCDSc1sus machine up

[Zach / Kiwamu]

 Woke up the c1sus machine in order to lock PSL to MC so that we can observe the effect of not having the EOM heater.

  6048   Wed Nov 30 01:35:49 2011 JenneUpdateCDSOSEM noise / nullstream and what does it mean for satellites

I'm picking points off of this no-magnet OSEM plot, and I thought I'd write them down somewhere so I don't have to do it again when I lose my sticky note...

1e-2 Hz        1.05e-2 um/rtHz

1e-1 Hz        3.4e-3 um/rtHz

1 Hz            1.3e-3 um/rtHz

10 Hz          2.5e-4 um/rtHz

60 Hz          7.5e-5 um/rtHz

100 Hz        7e-5 um/rtHz

400 Hz        7e-5 um/rtHz

  6049   Wed Nov 30 02:04:26 2011 rana, den, jenne, kiwamu, jzweizigUpdateCDSFiltering Noise issue tracked down ???

 You can read through all of our past tests to see what didn't work in tracking things down. As Den mentions, there was actually a lot of evidence that there was some double->single precision action in the filter calculation causing the noise we saw.

However, it turns out that this is NOT the case.

This afternoon I was so confused that I enlisted JZ to help us out. He came over and I tried to replicate the error. When looking at the time series, we noticed that it wasn't random noise; the signals seem to be getting clipped as they crossed zero. Sort of like a stiction problem. JZ left to go replicate the error on an offline system.

This turned out to be the important clue. As we examine the code we find this inside of fm10Gen.c:

if((new_hist < 1e-20) && (new_hist > -1e-20)) new_hist = new_hist<0 ? -1e-20: 1e-20;

this is line is basically trapping the filter history at 1e-20, to prevent some kind of numerical underflow problem (?). Seems reasonable, except that some filters which have higher order low passing in them actually have an overall scale factor which can be small (even as small as 1e-23, as Den pointed out).

So the reason we saw such weird behavior is that the first filter in SUSPOS is an AC coupling filter. This takes the OSEM signal and remove the large mean value. Then the next filter multiplies it by 1e-23 before doing the filtering and you end up with this noise in the filter history.

I looked and this line is commented out in the new BiQuad code, but as far as I can tell this issue has been around in aLIGO, eLIGO, iLIGO, etc. for a long time and could have been causing many cases of excess noise whenever we ended up a tiny gain factor in an IIR filter. At the 40m, there are easily a hundred such cases.

For now, I suppose we can just change this number to 1e-40 or so. I don't know how to calculate what the right number should be. Not sure why this underflow is not an issue for the BiQuad, however.

  6051   Wed Nov 30 11:04:26 2011 josephbUpdateCDSFiltering Noise issue tracked down ???

Quote:

For now, I suppose we can just change this number to 1e-40 or so. I don't know how to calculate what the right number should be. Not sure why this underflow is not an issue for the BiQuad, however.

According to the RCG SVN logs, the reason it was removed was a more general change done to the compiled code, not specific to just the biquad.  Basically, the ability to have an underflow number (subnormal) has been turned off completely by having any number that underflows set to zero. I'm not positive, but from a quick search looks that the smallest number before hitting is an underflow as a double is 2.2250738585072014e-308.

Alex's entry from the SVN log for 2663:

Added new fz_daz() function to turn on two bits in the FPU SSE control register.
Bits FZ (flush underflows to zero) and DOZ (denorms are zeros) are set to
avoid runaway code on float/double denorms (really small numbers).
Ref: http://software.intel.com/en-us/articles/how-to-avoid-performance-penalties-for-gradual-underflow-behavior/

SVN log 2664:

Removed +- 1e-20 limiting code, this is taken care of by setting FZ/DOZ bits
in the CPU SEE control register (see mathInline.h)

SVN log 2665:

Kill the underflows and roll down float denorms to zero,
see fz_doz() in mathInline.h.

  6052   Wed Nov 30 11:36:12 2011 DenUpdateCDSFiltering Noise issue tracked down ???

Quote:

 if((new_hist < 1e-20) && (new_hist > -1e-20)) new_hist = new_hist<0 ? -1e-20: 1e-20;

20 is indeed a random number. We can change it to 300. Alex said that during that iir filter calculations sometimes numbers are very small and if they are less then 1e-308 then a very slow code in the processor is executed and this will crash the online system. For single precision this number is 1e-38 and may be 10 years ago it was not decided for sure what to use - float or double. 20 will be "OK" for both but as we can see causes other problems.

Anyway, Alex removed this line from the code and added another code that sets the two proper bits in the MXCSR register and prohibits to the CPU to run the slow code. As far as I understand if the numbers are less then 1e-308 they become 0. Roughly, this is equivalent to 

 if((new_hist < 1e-308) && (new_hist > -1e-308)) new_hist = 0;

This is in 2.4 release. It is in the svn. I think we can install it and figure out if the problem is gone.

 

  6091   Thu Dec 8 19:48:23 2011 kiwamuUpdateCDSrestarted c1lsc machine and daqd

Since the c1lsc machine became frozen I restarted the c1lsc machined and daqd.

Then I burtrestored c1lsc, c1ass and c1oaf to this evening. They seem running okay.

  6095   Fri Dec 9 15:14:41 2011 DenUpdateCDSrelease 2.4

Alex has created a 2.4 branch of the RCD. Jamie, we can try to compile and install it. As a test a did it for c1oaf, it compiles, installs and runs once variables SITE, IFO, RCD_LIBRARY_PATH are properly defined. As we do not want to run one model at 2.4 code and others at 2.1, I recompiled c1oaf back to 2.1. Jamie, please, let me know when you are ready to upgrade to 2.4 release.

  6106   Mon Dec 12 13:02:08 2011 kiwamuUpdateCDSdaqd restarted

I have restarted the daqd process at 1:01 PM since I have added some new ALS's daq channels.

  6124   Thu Dec 15 11:47:43 2011 jamieUpdateCDSRTS UPGRADE IN PROGRESS

I'm now in the middle of upgrading the RTS to version 2.4.

All RTS systems will be down until futher notice...

  6125   Thu Dec 15 22:22:18 2011 jamieUpdateCDSRTS upgrade aborted; restored to previous settings

Unfortunately, after working on it all day, I had to abort the upgrade and revert the system back to yesterday's state.

I think I got most of the upgrade working, but for some reason I could never get the new models to talk to the framebuilder.  Unfortunately, since the upgrade procedure isn't document anywhere, it was really a fly by the seat of my pants thing.  I got some help from Joe, which got me through one road block, but I ultimately got stumped.

I'll try to post a longer log later about what exactly I went through.

In any event, the system is back to the state is was in yesterday, and everything seems to be working.

  6149   Mon Dec 26 12:04:41 2011 kiwamuUpdateCDSc1gcy.ini hand edited

I have edited c1scx.ini by hand in order to acquire some green locking related channels.

Somehow c1sus.ini, c1mcs.ini, c1scx.ini and c1scy.ini are not accessible via the daqconfig script.

As far as I remember it had been accessible via daqconfig a week ago when I edited c1scy.ini.

Anyway I had to edit it by hand. They need to be fixed at some point

  6173   Thu Jan 5 09:59:27 2012 JamieUpdateCDSRTS/RCG/DAQ UPGRADE TO COMMENCE

RTS/RCG/DAQ UPGRADE TO COMMENCE

I will be attempting (again) to upgrade the RTS, including the RCG and the daqd, to version 2.4 today.  The RTS will be offline until further notice.

  6174   Thu Jan 5 20:40:21 2012 JamieUpdateCDSRTS upgrade aborted; restored to previous settings; fb symmetricom card failing?

After running into more problems with the upgrade, I eventually decided to abort todays upgrade attempt, and revert back to where we were this morning (RTS 2.1).  I'll try to follow this with a fuller report explaining what problems I encountered when attempting the upgrade.

However, when Alex and I were trying to figure out what was going wrong in the upgrade, it appears that the fb symmetricom card lost the ability to sync with the GPS receiver.  When the symmeticom module is loaded, dmesg shows the following:

[  285.591880] Symmetricom GPS card on bus 6; device 0
[  285.591887] PIC BASE 2 address = fc1ff800
[  285.591924] Remapped 0x17e2800
[  285.591932] Current time 947125171s 94264us 800ns 
[  285.591940] Current time 947125171s 94272us 600ns 
[  285.591947] Current time 947125171s 94280us 200ns 
[  285.591955] Current time 947125171s 94287us 700ns 
[  285.591963] Current time 947125171s 94295us 800ns 
[  285.591970] Current time 947125171s 94303us 300ns 
[  285.591978] Current time 947125171s 94310us 800ns 
[  285.591985] Current time 947125171s 94318us 300ns 
[  285.591993] Current time 947125171s 94325us 800ns 
[  285.592001] Current time 947125171s 94333us 900ns 
[  285.592005] Flywheeling, unlocked...

Because of this, the daqd doesn't get the proper timing signal, and consequently is out of sync with the timing from the models.

It's completely unclear what caused this to happen.  The card seemed to be working all day today, then Alex and I were trying to debug some other(maybe?) timing issues and the symmetricom card all of a sudden stopped syncing to the GPS.  We tried rebooting the frame builder and even tried pulling all the power to the machine, but it never came back up.  We checked the GPS signal itself and to the extend that we know what that signal is supposed to look like it looked ok.

I speculate that this is also the cause of the problems were were seeing earlier in the week.  Maybe the symmetricom card has just been acting flaky, and something we did pushed it over the edge.

Anyway, we will try to replace it tomorrow, but Alex is skeptical that we have a replacement of this same card.  There may be a newer Spectracom card we can use, but there may be problems using it on the old sun hardware that the fb is currently running on.  We'll see.

In the mean time, the daqd is running rogue, off of it's own timing.  Surprisingly all of the models are currently showing 0x0 status, which means no problems.  It doesn't seem to be recording any data, though.  Hopefully we'll get it all sorted out tomorrow.

  6175   Fri Jan 6 01:00:56 2012 kiwamuUpdateCDSc1scx out of sync

Both the c1scx and its IOP realtime processes became out of sync.

Initially I found that the c1scx didn't show any ADC signals, though the sync sign was green.

Then I software-rebooted the c1iscex machine and then it became out of sync.

For tonight this is fine because I am concentrating on the central part anyway.

  6176   Fri Jan 6 11:49:13 2012 JamieUpdateCDSframebuilder taken offline to diagnose problem with symmetricom timing card

Alex and I have taken the framebuilder offline to try to see what's wrong with the symmetricom card.  We have removed the card from the chassis and Alex has taken it back to downs to do some more debugging.

We have been formulating some alternate methods to get timing to the fb in case we can't end up getting the card working.

  6177   Fri Jan 6 14:31:54 2012 JamieUpdateCDSframebuilder back online, using NTP time syncronization

The framebuilder is back online now, minus it's symmetricom GPS card.  The card seems to have failed entirely, and was not able to be made to work at downs either.  It has been entirely removed from fb.

As a fall back, the system has been made to work off of the system NTP-based time synchronization.  The latest symmetricom driver, which is part of the RCG 2.4 branch, will fall back to using local time if the GPS synchronization fails.  The new driver was compiled from our local checkout of the 2.4 source in the new to-be-used-in-the-future rtscore directory:

controls@fb ~ 0$ diff {/opt/rtcds/rtscore/branches/branch-2.4/src/drv/symmetricom,/lib/modules/2.6.34.1/kernel/drivers/symmetricom}/symmetricom.ko
controls@fb ~ 0$ 

The driver was reloaded.   daqd was also linked against the last running stable version and restarted:

controls@fb ~ 0$ ls -al $(readlink -f /opt/rtcds/caltech/c1/target/fb/daqd)
-rwxr-xr-x 1 controls controls 6592694 Dec 15 21:09 /opt/rtcds/caltech/c1/target/fb/daqd.20120104
controls@fb ~ 0$ 

We'll have to keep an eye on the system, to see that it continues to record data properly, and that the fb and the front-ends remain in sync.

The question now is what do we do moving forward.  CDS is not supporting the symmetricom cards anymore, and have moved to using Spectracom GPS/IRIG-B cards.  However, Downs has neither at the moment.  Even if we get a new Spectracom card, it might not work in this older Sun hardware, in which case we might need to consider upgrading the framebuilder to a new machine (one supported by CDS).

  6179   Fri Jan 6 20:10:48 2012 ranaUpdateCDSframebuilder back online, using NTP time syncronization

 

 You (Jamie) should talk with Rolf and Alex to find out what framebuilder they will support for > 3 years. Then we should buy that along with the adapter card which allows us to use the same RAID we now have for the frames.

  6182   Mon Jan 9 23:52:15 2012 kiwamuUpdateCDSSUS channels not accessible from dataviewer

[John / Kiwamu]

 We found that some of the suspensions channels (for example C1:SUS-BS_POS_IN1 and etc) were not accessible from dataviewer for some reasons.

So far it seems none of the channels associated with c1sus are accessible from dataviewer.

  6204   Tue Jan 17 02:44:59 2012 kiwamuUpdateCDSawg not working

AWG is not working. This needs to be fixed.

I could set the channel and the parameters in the AWGGUI screen, but it never inject signals to the realtime system.

  6207   Tue Jan 17 16:09:20 2012 kiwamuUpdateCDSawg not working on the c1sus machine

Actually awg works fine without any problems when the excitation channels belong to the c1lsc machine.

It seems that the awg doesn't inject signals on the channels of the c1sus machine, for example C1:SUS-BS_LSC_EXC and so on.

Quote from #6204

AWG is not working. This needs to be fixed.

  6319   Fri Feb 24 23:14:09 2012 kiwamuUpdateCDStdsavg went crazy

I found that the LSCoffset script didn't work today. The script is supposed to null the electrical offsets in all the LSC channels.

I went through the sentences in the script and eventually found that the tdsavg command returns 0 every time.

I thought this was related to the test points, so I ran the following commands to flush all the test point running and the issue was solved.

[term]> diag              
[diag]>open               
[diag]> diag  tp clear *  
 

EDIT, JCD 11June2012:  3rd line there should just be [diag]> tp clear *

  6327   Mon Feb 27 19:04:13 2012 jamieUpdateCDSspontaneous timing glitch in c1lsc IO chassis?

For some reason there appears to have been a spontaneous timing glitch in the c1lsc IO chassis that caused all models running on c1lsc to loose timing sync with the framebuilder.  All the models were reporting "0x4000" ("Timing mismatch between DAQ and FE application") in the DAQ status indicator.  Looking in the front end logs and dmesg on the c1lsc front end machine I could see no obvious indication why this would have happened.  The timing seemed to be hooked up fine, and the indicator lights on the various timing cards were nominal.

I restarted all the models on c1lsc, including and most importantly the c1x04 IOP, and things came back fine.  Below is the restart procedure I used.  Note I killed all the control models first, since the IOP can't be restarted if they're still running.  I then restarted the IOP, followed by all the other control models.

controls@c1lsc ~ 0$ for m in lsc ass oaf; do /opt/rtcds/caltech/c1/scripts/killc1${m}; done
controls@c1lsc ~ 0$ /opt/rtcds/caltech/c1/scripts/startc1x04
c1x04epics C1 IOC Server started
 * Stopping IOP awgtpman ...                                                                      [ ok ]
controls@c1lsc ~ 0$ for m in lsc ass oaf; do /opt/rtcds/caltech/c1/scripts/startc1${m}; done
c1lscepics: no process found
ERROR: Module c1lscfe does not exist in /proc/modules
c1lscepics C1 IOC Server started
 * WARNING:  awgtpman_c1lsc has not yet been started.
c1assepics: no process found
ERROR: Module c1assfe does not exist in /proc/modules
c1assepics C1 IOC Server started
 * WARNING:  awgtpman_c1ass has not yet been started.
c1oafepics: no process found
ERROR: Module c1oaffe does not exist in /proc/modules
c1oafepics C1 IOC Server started
 * WARNING:  awgtpman_c1oaf has not yet been started.
controls@c1lsc ~ 0$ 
  6392   Fri Mar 9 11:59:38 2012 Zweizig the ELOG MavenSummaryCDSNDS2 restart

 Hi Rana,

It looks like the channe list file has a few blank lines that the channel list reader is choking on. I removed the lines and it is working now.. I have made the error message a bit more obvious (gave the file name  and line number) and allowed it to ignore empty lines so this won't cause problems with future versions (when installed). The bottom line is nds2 is now running on mafalda.
Best regards,

John   

  6397   Fri Mar 9 20:44:24 2012 Jim LoughUpdateCDSDAQ restart with new ini file

DAQ reload/restart was performed at about 1315 PST today. The previous ini file was backed up as c1pem20120309.ini in the /chans/daq/working_backups/ directory.

I set the following to record:

The two JIMS channels at 2048:
[C1:PEM-JIMS_CH1_DQ] Persistent version of JIMS channel. When bit drops to zero indicating something bad (BLRMS threshold exceeded) happens the bit stays at zero  for >= the value of the persist EPICS variable.
[C1:PEM-JIMS_CH2_DQ] Non-persistent version of JIMS channel.

And all of the BLRMS channels at 256:
Names are of the form:
[C1:PEM-RMS_ACC1_F0p1_0p3_DQ]
[C1:PEM-RMS_ACC1_F0p3_1_DQ]

On monday I intend to look at the weekend seismic data to establish thresholds on the JIMS channels.

256 was the lowest rate possible according to the RCG manual. The JIMS channels are recorded at 2048 because I couldn't figure out how to disable the decimation filter. I will look into this further.

  6404   Tue Mar 13 13:28:31 2012 Ryan FisherUpdateCDSDAQ restart with new ini file
Extra note: This was the ini file that was edited:
/cvs/cds/rtcds/caltech/c1/chans/daq/C1PEM.ini
  6422   Thu Mar 15 08:48:40 2012 RyanSummaryCDSSummary of Syracuse Visit to 40m Mar 5-9 2012

JIMS Channels in PEM Model

The PEM model has been modified now to include the JIMS(Joint Information Management System) channel processing. Additionally Jim added test points at the outputs of the BLRMS.

For each seismometer channel, five bands are compared to threshold values to produce boolean results.  Bands with RMS below threshold produce bits with value 1, above threshold results in 0.  These bits are combined to produce one output channel that contains all of the results.

A persistent version of the channel is generated by a new library block that called persist which holds the value at 0 for a number of time steps equal to an EPICS variable setting from the time the boolean first drops to zero. The persist allows excursions shorter than the timestep of a downsampled timeseries to be seen reliably.

The EPICS variables for the thresholds are of the form (in order of increasing frequency):
C1:PEM-JIMS_GUR1X_THRES1
C1:PEM-JIMS_GUR1X_THRES2
etc.
 
The EPICS variables for the persist step size are of the form:
C1:PEM-JIMS_GUR1X_PERSIST
C1:PEM-JIMS_GUR1Y_PERSIST
etc.

The JIMS Channels are being recorded and written to frames:

The two JIMS channels at 2048:
[C1:PEM-JIMS_CH1_DQ] Persistent version of JIMS channel. When bit drops to zero indicating something bad (BLRMS threshold exceeded) happens the bit stays at zero  for >= the value of the persist EPICS variable.
[C1:PEM-JIMS_CH2_DQ] Non-persistent version of JIMS channel.

And all of the BLRMS channels at 256:
Names are of the form:
[C1:PEM-RMS_ACC1_F0p1_0p3_DQ]
[C1:PEM-RMS_ACC1_F0p3_1_DQ]

For additional details about the JIMS Channels and the implementation, please see the previous elog entries by Jim.

 

Conlog

I have a working aLIGO Conlog/EPICS Log installed and running on megatron.

Please see this wiki page for the details of use:
https://wiki-40m.ligo.caltech.edu/aLIGO%20EPICs%20log%20%28conlog%29

I also edited this page with restart instructions for megatron:
https://wiki-40m.ligo.caltech.edu/Computer_Restart_Procedures#megatron

Please see Ryan's previous elog entries for installation details.

Future Work

  • Determine useful thresholds for each band
  • Generate MEDM Screens for JIMS Channels
  • Add a decimation option to channels
  • Add EPICS Strings in PEM model to describe bits in JIMS Channels
  • Add additional JIMS Channels: Testing additional characterization methods
  • Implement a State Log on Megatron: Will Provide a 1Hz index into JIMS Channels
  • Generate a single web page that allows access to aLIGO Conlog/EPICS Log and State Log

 

  6436   Thu Mar 22 16:45:06 2012 kiwamuUpdateCDSc1scx and c1scy not properly running

It seems that neither c1scx nor c1scy is working properly as their ADC counts are showing digital-zeros.

However the IOPs, c1gcx and c1gcy look running fine, and also the IOPs seem successfully recognizing the ADCs according to dmesg.

Also there is one more confusing fact : c1scx and c1scy are synchronizing to the timing signal somehow.

I restarted the c1scx front end model to see if this helps, but unfortunately it didn't work.

As this is not the top priority concern for now, I am leaving them as they are now with the watchgods off.

(I may try hardware rebooting them in this evening)

Quote from #6434

The power was turned back on at 4pm It took some time for Suresh to restart the computers. We have damping but things are not perfect yet. Auto BURTH did not work well.

 

  6438   Thu Mar 22 17:41:15 2012 sureshUpdateCDSc1scx and c1scy not properly running

Quote:

It seems that neither c1scx nor c1scy is working properly as their ADC counts are showing digital-zeros.

Quote from #6434

The power was turned back on at 4pm It took some time for Suresh to restart the computers. We have damping but things are not perfect yet. Auto BURTH did not work well.

 When Steve and I restarted the c1iscex and c1iscey computers after the power shutdown, the models within them did not start-up automatically.  I had to start them manually from a terminal in the control room. 

I also tried rebooting the FB a couple of times.  Did not make any difference.

Manually starting the c1x05, c1scy and c1x01, c1scx models (with the Burt Restore button ON) did not resolve the issue of zeros in the epics screens.  though it did re-establish timing. 

  6439   Thu Mar 22 23:43:56 2012 KojiUpdateCDSc1scx and c1scy not properly running

Did you guys checked if the simplant switch is set to "REAL WORLD" mode?

Edit by KI:

Bingo ! The input signals were bypassed to the simplant. I switched the simplant settings to REAL WORLD and now both end suspensions are working fine.

  6540   Tue Apr 17 11:05:04 2012 JamieUpdateCDSCDS upgrade in progress

I am continuing to attempt to upgrade the CDS system to RTS 2.5.  Systems will continue to be up and down for the rest of the day.

  6541   Tue Apr 17 19:03:09 2012 JamieUpdateCDSCDS upgrade in progress

Upgrade progresses, but not complete.  There are some relatively minor issues, and one potentially big issue.

All new software has been installed, including the new epics that supports long channel names.

I've been doing a LOT of cleanup.  It was REALLY messy in there.

The new framebuilder/daqd code is running on fb.

Models are compiling with the new RCG and I am able to get them running.  Some of them are not compiling for relatively minor reasons (the simulink models need updating).  I'm also running into compile problems with IOPs that are using the dolphin drivers.

The major issue is that the framebuilder and the models are not syncing their timing, so there's no data collection.  I've spoken to Alex and he and Rolf are going to come over tomorrow to sort it out.  It's possible that we're missing timing hardware that the new code is expecting.

There are still some stability issues I haven't sorted out yet, and I have a lot more cleanup to do.

At this rate I'm going to shoot for being done Thursday.

  6546   Wed Apr 18 19:59:48 2012 JamieUpdateCDSCDS upgrade success

The upgrade is nearly complete:

  • new daqd code is running on fb
  • the fe/daqd timing issue was resolved by adjusting the GPS offset in the daqdrc.  I will document this more later.
  • the power outage conveniently rebooted all the front-end machines, so they're all now running new caRepeater
  • all models have been successfully recompiled with RCG 2.5 (with only a couple small glitches)
  • all new models are running on all front-end machines (with a couple exceptions)
  • all suspension models seem to be damping under local control (PRM is having troubles that are likely unrelated to the upgrade).
  • a lot of cleanup has been done

Remaining tasks/issues:

  • more testing OF EVERYTHING needs o be done
  • I did not yet update the DIS dolphin code, so we're running with the old code.  I don't think this is a problem, but it would be nice to get us running what they're running at the sites
  • I tried to cleanup/simplify how front-end initialization is done.  However, there is a problem and models are not auto-starting after reboot.  This needs to be fixed.
  • the userapps directory is in a new place (/opt/rtcds/userapps).  Not everything in the old location was checked into the repository, so we need to check to make sure everything that needs to be is checked in, and that all the models are running the right code.
  • the c1oaf model seems to be having a dolphin issue that needs to be sorted
  • the c1gfd model causes c1ioo to crash immediately upon being loaded.  I have removed it from the rtsystab.  That model needs to be fixed.
  • general model cleanup is in order.
  • more front-end cleanup is needed, particularly in regards to boot-up procedure.
  • document the entire upgrade procedure.

I'll finish up these remaining tasks tomorrow.

  6547   Wed Apr 18 23:12:49 2012 DenUpdateCDSoaf

Adaptive filter outputs some non-zero signal in the OFF position

Screenshot.png

I turned "ON" one of them and c1lsc suspended, I've rebooted it and restarted models on c1lsc and c1sus.

Now it also outputs something non-zero though the first line of the adaptive code is "if(OFF) output=0.0; return;" May be another version of the code has been compiled.

Edit: Old version (~september) of the code and oaf model is running now. In the 2.1 code there was a link from src/epics/simLink to oaf code for each DOF. It seems that 2.5 version finds models and c codes in standard directories. I need to move working code to the proper directory.

ELOG V3.1.3-