ID |
Date |
Author |
Type |
Category |
Subject |
5896
|
Tue Nov 15 15:56:23 2011 |
jamie | Update | CDS | dataviewer doesn't run |
Quote: |
Dataviewer is not able to access to fb somehow.
I restarted daqd on fb but it didn't help.
Also the status screen is showing a blank while form in all the realtime model. Something bad is happening.
|
So something very strange was happening to the framebuilder (fb). I logged on the fb and found this being spewed to the logs once a second:
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
Apparently /bin/gcore was trying to be called by some daqd subprocess or thread, and was failing since that file doesn't exist. This apparently started at around 5:52 AM last night:
[Tue Nov 15 05:46:52 2011] main profiler warning: 1 empty blocks in the buffer
[Tue Nov 15 05:46:53 2011] main profiler warning: 0 empty blocks in the buffer
[Tue Nov 15 05:46:54 2011] main profiler warning: 0 empty blocks in the buffer
[Tue Nov 15 05:46:55 2011] main profiler warning: 0 empty blocks in the buffer
[Tue Nov 15 05:46:56 2011] main profiler warning: 0 empty blocks in the buffer
...
[Tue Nov 15 05:52:43 2011] main profiler warning: 0 empty blocks in the buffer
[Tue Nov 15 05:52:44 2011] main profiler warning: 0 empty blocks in the buffer
[Tue Nov 15 05:52:45 2011] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1005400026 to 1005400379
[Tue Nov 15 05:52:46 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 05:52:46 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
The gcore I believe it's looking for is a debugging tool that is able to retrieve images of running processes. I'm guessing that something caused something int the fb to eat crap, and it was stuck trying to debug itself. I can't tell what exactly happend, though. I'll ping the CDS guys about it. The daqd process was continuing to run, but it was not responding to anything, which is why it could not be restarted via the normal means, and maybe why the various FB0_*_STATUS channels were seemingly dead.
I manually killed the daqd process, and monit seemed to bring up a new process with no problem. I'll keep an eye on it. |
5901
|
Tue Nov 15 23:44:44 2011 |
Mirko | Update | CDS | C1:LSC & C1:SUS restarted |
Earlier this evening C1:LSC died then I hit the DAQ reload after adding an OAF channel to be recorded. No change to any model. Had to restart C1:SUS too. Reloaded burts from this morning 5am, except for C1:IOO, which I loaded from 16:07. |
5938
|
Fri Nov 18 01:12:14 2011 |
Suresh | Update | CDS | MC1 LR dead for > 1 month; now revived temporarily |
[Den, Mirko, Suresh]
We were investigating why there is no correlation between MC1 osem signals and seismic motion. During this we noticed a recurrence of this old problem of MC1_LR sensor being dead. I went and pressed down the chip holders where the AA filters used to sit and which now hold the jumper wire. The board is large and flexible it is quite likely some solder joint is broken on the MC1_LR path on this board.
The signal came back to life and is okay now. But it can break off again any time.
Quote: |
Since the MC1 LRSEN channel is not wasn't working, my input matrix diagonalization hasn't worked today wasn't working. So I decided to fix it somehow.
I went to the rack and traced the signal: first at the LEMO monitor on the whitening card, secondly at the 4-pin LEMO cable which goes into the AA chassis.
The signal existed at the input to the AA chassis but not in the screen. So I pressed the jumper wire (used to be AA filter) down for the channel corresponding to the MC1 LRSEN channel.
It now has come back and looks like the other sensors. As you can see from this plot and Joe's entry from a couple weeks ago, this channel has been dead since May 17th.
The ELOG reveals that Kiwamu caught Steve doing some (un-elogged) fooling around there. Burnt Toast -> Steve.

993190663 = free swinging ringdown restarted again
|
|
5971
|
Mon Nov 21 17:07:34 2011 |
Mirko | Update | CDS | c1pem model dead |
For some reason C1PEM doesn't seem to work anymore after a recompilation. It did recompile fine. We just changed some channel / subsystem names.
Tried reverting to the svn version. Doesn't work. Reboot C1SUS also no good. |
5973
|
Mon Nov 21 22:51:55 2011 |
Mirko | Update | CDS | c1pem model dead |
Quote: |
For some reason C1PEM doesn't seem to work anymore after a recompilation. It did recompile fine. We just changed some channel / subsystem names.
Tried reverting to the svn version. Doesn't work. Reboot C1SUS also no good.
|
It is fine again. Thanks Jamie. |
5979
|
Tue Nov 22 18:15:39 2011 |
jamie | Update | CDS | c1iscex ADC found dead. Replaced, c1iscex working again |
c1iscex has not been running for a couple of days (since the power shutdown at least). I was assuming that the problem was recurrence of the c1iscex IO chassis issue from a couple weeks ago (5854). However, upon investigation I found that the timing signals were all fine. Instead, the IOP was reporting that it was finding now ADC, even though there is one in the chassis.
Since I had a spare ADC that was going to be used for the CyMAC, I decided to try swapping it out to see if that helped. Sure enough, the system came up fine with the new ADC. The IOP (c1x01) and c1scx are now both running fine.
I assume the issue before might have been caused by a failing and flaky ADC, which has now failed. We'll need to get a new ADC for me to give back to the CyMAC. |
6002
|
Thu Nov 24 15:27:15 2011 |
kiwamu | Update | CDS | c1iscey hardware rebooted |
The c1iscey machine crashed around 1:00 AM last night and I did a hard-ware reboot by pressing a button on the front panel of the machine.
After the reboot its been running okay so far.
The crash happened after I pressed the "Diag Reset" button on the CDS status screen. |
6011
|
Fri Nov 25 22:11:12 2011 |
Mirko | Update | CDS | Beware of fancy filter modules |
[Rana, Den, Mirko]
It seems you can shoot yourself in the foot if your filter modules are too complex.
Den discovered this when looking into the C1:SUS-MC?_SUSPOS filter module named Cheby, consisting of cheby1("LowPass",6,1,12)cheby1("LowPass",2,0.1,3)gain(1.13501) by noticing that the coherence between input and output of the filter is low.
Cheby filter:


This is most likely due to the filter spanning more than the 16 orders of precision that the double data type spans.
The coherence is fine when one splits the filter in two, giving every cheby1 filter its own module. The coherence is also fine when you use the Cheby filter in a 2kHz system, although the freq. response looks very odd
Black: 16kHz, Red 2kHz (yes the filter was converted correctly, no text file editing there)

The problem occurs on c1lsc as well as c1sus computer.
Looking into the foton files actually points to a precision problem, with the huge range of scale covered in there:
C1:MCS 16kHz (Cheby: Original filter with low coherence. CHbyTST & ChebyTST: Original filter split amongst two filter modules)
################################################################################
### SUS_MC3_LSC ###
################################################################################
# DESIGN SUS_MC3_LSC 0 zpk([0],[30],0.333333,"n")
# DESIGN SUS_MC3_LSC 1 cheby1("LowPass",6,1,12)
# DESIGN SUS_MC3_LSC 2 cheby1("LowPass",2,0.1,3)gain(1.13501) \
#
# DESIGN SUS_MC3_LSC 3 cheby1("LowPass",2,0.1,3)gain(1.13501)cheby1("LowPass",6,1,12)
# DESIGN SUS_MC3_LSC 4 ellip("BandStop",4,1,40,16.1,16.9)ellip("BandStop",4,1,40,23.7,24.5)gain(1.25871)
### ###
SUS_MC3_LSC 0 12 1 32768 0 30:0.0 9.942903833923793 -0.9885608209680459 0.0000000000000000 -1.0000000000000000 0.0000000000000000
SUS_MC3_LSC 1 21 3 0 0 CHbyTST 9.095012702673064e-18 -1.9978637592754149 0.9978663974923444 2.0000000000000000 1.0000000000000000
-1.9984258494490537 0.9984376515442090 2.0000000000000000 1.0000000000000000
-1.9994068831713223 0.9994278587363880 2.0000000000000000 1.0000000000000000
SUS_MC3_LSC 2 12 1 32768 0 ChebyTST 1.228759186937126e-06 -1.9972699801052749 0.9972743606395355 2.0000000000000000 1.0000000000000000
SUS_MC3_LSC 3 12 4 32768 0 Cheby 1.117558041371939e-23 -1.9972699801052749 0.9972743606395355 2.0000000000000000 1.0000000000000000
-1.9978637592754149 0.9978663974923444 2.0000000000000000 1.0000000000000000
-1.9984258494490537 0.9984376515442090 2.0000000000000000 1.0000000000000000
-1.9994068831713223 0.9994278587363880 2.0000000000000000 1.0000000000000000
SUS_MC3_LSC 4 12 8 32768 0 BounceRoll 0.9991466189294013 -1.9996634951844035 0.9997010181703262 -1.9999611719719754 0.9999999999999997
-1.9999303040590390 0.9999684339228864 -1.9999605309876360 0.9999999999999999
-1.9999248796830529 0.9999668732412945 -1.9999594299327190 1.0000000000000002
-1.9996385459838455 0.9996812069238987 -1.9999587601905868 1.0000000000000000
-1.9996161812709703 0.9996978939989944 -1.9999163485656493 0.9999999999999999
-1.9998855694973159 0.9999681878303275 -1.9999154056705493 0.9999999999999998
-1.9998788577090287 0.9999671193335300 -1.9999137972442669 1.0000000000000000
-1.9995951159123118 0.9996843310430819 -1.9999128255920269 1.0000000000000000
C1:OAF 2kHz
###############################################################################
### YARM_IN ###
################################################################################
# DESIGN YARM_IN 0 zpk([0],[30],0.333333,"n")
# DESIGN YARM_IN 3 cheby1("LowPass",6,1,12)cheby1("LowPass",2,0.1,3)gain(1.13501)
# DESIGN YARM_IN 4 ellip("BandStop",4,1,40,16.1,16.9)ellip("BandStop",4,1,40,23.7,24.5)gain(1.25871)
# DESIGN YARM_IN 8 cheby1("LowPass",6,1,12)cheby1("LowPass",2,0.1,3)gain(1.13501)zpk([],[10],1,"n")
### ###
YARM_IN 0 12 1 4096 0 30:0.0 9.56649943398763 -0.9119509539166185 0.0000000000000000 -1.0000000000000000 0.0000000000000000
YARM_IN 3 12 4 4096 0 Cheby 1.829878084970283e-16 -1.9828889048300398 0.9830565293861987 2.0000000000000000 1.0000000000000000
-1.9868188576622443 0.9875701115261976 2.0000000000000000 1.0000000000000000
-1.9940934073784453 0.9954330165532327 2.0000000000000000 1.0000000000000000
-1.9781245722853238 0.9784022621062476 2.0000000000000000 1.0000000000000000 |
Attachment 1: ChebyTST3.png
|
|
6013
|
Sat Nov 26 02:05:43 2011 |
Mirko | Update | CDS | Beware of fancy filter modules |
We replaced the complicated Cheby filter module with three separate filter modules. Probably the filter doesn't need to be so complicated, but rather not change too many things at once. The new filter modules are called:
Ch1, Ch2, Ch3 and are in filter module 3,9, and 10 of the C1:SUS-MC?_SUSPOS filters. The coherence with these filters is fine. Someone should look into the possibility of simplifying these filters.
It would be good to check for numerical problems in other filters! |
6017
|
Sat Nov 26 10:55:40 2011 |
rana | Update | CDS | Beware of fancy filter modules |
Could be that what we're seeing is the noise floor of the Direct Form II filter structure (see Matt's 2008 elog) which shows an example (also see G0900928-v1 ).
|
6020
|
Mon Nov 28 06:53:30 2011 |
kiwamu | Update | CDS | c1sus shutdown |
I have restarted the c1sus machine around 9:00 PM yesterday and then shut it down around 4:00 AM this morning after a little bit of taking care of the interferomter.
Quote from #6016 |
c1sus has been shutdown so that the optics dont bang around. This is because the watch dogs are not working.
|
|
6026
|
Mon Nov 28 16:46:55 2011 |
kiwamu | Update | CDS | c1sus is now up |
I have restarted the c1sus machine and burt-restored c1sus and c1mcs to the day before Thank giving, namely 23rd of November.
Quote from #6020 |
I have restarted the c1sus machine around 9:00 PM yesterday and then shut it down around 4:00 AM this morning after a little bit of taking care of the interferometer.
|
|
6030
|
Mon Nov 28 19:24:51 2011 |
Jenne | Update | CDS | Beware of fancy filter modules |
[Rana, Jenne]
Some of the funniness is some kind of mysterious interaction between 2 filter modules in the filter banks. Just FM1 (30:0.0) or just FM4 (Cheby, which is 2 cheby1's) has reasonable coherence. Both FM1 and FM4 together doesn't do so well - the coherence goes way down.
Just FM1 (30:0.0)

Just FM4 (Cheby)

Both FM1 and FM4

All the coherences plotted together

You'd think that the signal encounters FM1, gets filtered, and that result is the signal sent to the next active filter module, FM4, so the 2 filter modules shouldn't interact. But clearly there's some funny business here since engaging both makes things crappy.
Matlab investigations to replicate this behavior offline are in progress. |
Attachment 4: SUSPOS_ETMY_30and0andCheby_compareCoherence.pdf
|
|
6031
|
Mon Nov 28 22:09:24 2011 |
rana | Update | CDS | Beware of fancy filter modules |
To see what might be causing the problem, I used a version of the filter noise test matlab code that Matt had in the elog.
To see if it was a single precision problem, I just recast the input data: x = single(x)
This is not strictly correct, since some of the rest of the operations are as double precision, but I think that attached plot shows that a casting from double to single is close to the right amount of noise to explain our excess noise problem in the 0.1-1 Hz region.
Den is going to interview Alex to find out if we have some kind of issue like this. My understanding was that all of our filter module calculations were being done in double precision (64 bit), but its possible that some single stuff has crept back in. Currently the FIR filtering code IS single precision and in the past, the SUS code which didn't carry the LSC signals (meaning ASC and damping) were done in single precision. |
Attachment 1: noise.pdf
|
|
6033
|
Tue Nov 29 04:47:49 2011 |
kiwamu | Update | CDS | c1sus shut down again |
I have shut down the c1sus machine at 3:30 AM. |
6037
|
Tue Nov 29 15:30:01 2011 |
jamie | Update | CDS | location of currently used filter function |
So I tracked down where the currently-used filter function code is defined (the following is all relative to /opt/rtcds/caltech/c1/core/release):
Looking at one of the generated front-end C source codes (src/fe/c1lsc/c1lsc.c) it looks like the relevant filter function is:
filterModuleD()
which is defined in:
src/include/drv/fm10Gen.c
and an associated header file is:
src/include/fm10Gen.h
|
6038
|
Tue Nov 29 15:57:43 2011 |
Den | Update | CDS | location of currently used filter function |
We are interested in the following question : Can the structures defined in fm10Gen.h (or some other *.c *.h files with defined as FLOAT variables) create single precision instead of double in the filter calculations?
typedef struct FM_OP_IN{
UINT32 opSwitchE; /* Epics Switch Control Register; 28/32 bits used*/
UINT32 opSwitchP; /* PIII Switch Control Register; 28/32 bits used*/
UINT32 rset; /* reset switches */
float offset; /* signal offset */
float outgain; /* module gain */
float limiter; /* used to limit the filter output to +/- limit val */
int rmpcmp[FILTERS]; /* ramp counts: ramps on a filter for type 2 output*/
/* comparison limit: compare limit for type 3 output*/
/* not used for type 1 output filter */
int timeout[FILTERS]; /* used to timeout wait in type 3 output filter */
int cnt[FILTERS]; /* used to keep track of up and down cnt of rmpcmp */
/* should be initialized to zero */
float gain_ramp_time; /* gain change ramping time in seconds */
} FM_OP_IN;
|
6042
|
Tue Nov 29 18:54:29 2011 |
kiwamu | Update | CDS | c1sus machine up |
[Zach / Kiwamu]
Woke up the c1sus machine in order to lock PSL to MC so that we can observe the effect of not having the EOM heater. |
6048
|
Wed Nov 30 01:35:49 2011 |
Jenne | Update | CDS | OSEM noise / nullstream and what does it mean for satellites |
I'm picking points off of this no-magnet OSEM plot, and I thought I'd write them down somewhere so I don't have to do it again when I lose my sticky note...
1e-2 Hz 1.05e-2 um/rtHz
1e-1 Hz 3.4e-3 um/rtHz
1 Hz 1.3e-3 um/rtHz
10 Hz 2.5e-4 um/rtHz
60 Hz 7.5e-5 um/rtHz
100 Hz 7e-5 um/rtHz
400 Hz 7e-5 um/rtHz |
6049
|
Wed Nov 30 02:04:26 2011 |
rana, den, jenne, kiwamu, jzweizig | Update | CDS | Filtering Noise issue tracked down ??? |
You can read through all of our past tests to see what didn't work in tracking things down. As Den mentions, there was actually a lot of evidence that there was some double->single precision action in the filter calculation causing the noise we saw.
However, it turns out that this is NOT the case.
This afternoon I was so confused that I enlisted JZ to help us out. He came over and I tried to replicate the error. When looking at the time series, we noticed that it wasn't random noise; the signals seem to be getting clipped as they crossed zero. Sort of like a stiction problem. JZ left to go replicate the error on an offline system.
This turned out to be the important clue. As we examine the code we find this inside of fm10Gen.c:
if((new_hist < 1e-20) && (new_hist > -1e-20)) new_hist = new_hist<0 ? -1e-20: 1e-20;
this is line is basically trapping the filter history at 1e-20, to prevent some kind of numerical underflow problem (?). Seems reasonable, except that some filters which have higher order low passing in them actually have an overall scale factor which can be small (even as small as 1e-23, as Den pointed out).
So the reason we saw such weird behavior is that the first filter in SUSPOS is an AC coupling filter. This takes the OSEM signal and remove the large mean value. Then the next filter multiplies it by 1e-23 before doing the filtering and you end up with this noise in the filter history.
I looked and this line is commented out in the new BiQuad code, but as far as I can tell this issue has been around in aLIGO, eLIGO, iLIGO, etc. for a long time and could have been causing many cases of excess noise whenever we ended up a tiny gain factor in an IIR filter. At the 40m, there are easily a hundred such cases.
For now, I suppose we can just change this number to 1e-40 or so. I don't know how to calculate what the right number should be. Not sure why this underflow is not an issue for the BiQuad, however. |
6051
|
Wed Nov 30 11:04:26 2011 |
josephb | Update | CDS | Filtering Noise issue tracked down ??? |
Quote: |
For now, I suppose we can just change this number to 1e-40 or so. I don't know how to calculate what the right number should be. Not sure why this underflow is not an issue for the BiQuad, however.
|
According to the RCG SVN logs, the reason it was removed was a more general change done to the compiled code, not specific to just the biquad. Basically, the ability to have an underflow number (subnormal) has been turned off completely by having any number that underflows set to zero. I'm not positive, but from a quick search looks that the smallest number before hitting is an underflow as a double is 2.2250738585072014e-308.
Alex's entry from the SVN log for 2663:
Added new fz_daz() function to turn on two bits in the FPU SSE control register.
Bits FZ (flush underflows to zero) and DOZ (denorms are zeros) are set to
avoid runaway code on float/double denorms (really small numbers).
Ref: http://software.intel.com/en-us/articles/how-to-avoid-performance-penalties-for-gradual-underflow-behavior/
SVN log 2664:
Removed +- 1e-20 limiting code, this is taken care of by setting FZ/DOZ bits
in the CPU SEE control register (see mathInline.h)
SVN log 2665:
Kill the underflows and roll down float denorms to zero,
see fz_doz() in mathInline.h. |
6052
|
Wed Nov 30 11:36:12 2011 |
Den | Update | CDS | Filtering Noise issue tracked down ??? |
Quote: |
if((new_hist < 1e-20) && (new_hist > -1e-20)) new_hist = new_hist<0 ? -1e-20: 1e-20;
|
20 is indeed a random number. We can change it to 300. Alex said that during that iir filter calculations sometimes numbers are very small and if they are less then 1e-308 then a very slow code in the processor is executed and this will crash the online system. For single precision this number is 1e-38 and may be 10 years ago it was not decided for sure what to use - float or double. 20 will be "OK" for both but as we can see causes other problems.
Anyway, Alex removed this line from the code and added another code that sets the two proper bits in the MXCSR register and prohibits to the CPU to run the slow code. As far as I understand if the numbers are less then 1e-308 they become 0. Roughly, this is equivalent to
if((new_hist < 1e-308) && (new_hist > -1e-308)) new_hist = 0;
This is in 2.4 release. It is in the svn. I think we can install it and figure out if the problem is gone.
|
6091
|
Thu Dec 8 19:48:23 2011 |
kiwamu | Update | CDS | restarted c1lsc machine and daqd |
Since the c1lsc machine became frozen I restarted the c1lsc machined and daqd.
Then I burtrestored c1lsc, c1ass and c1oaf to this evening. They seem running okay. |
6095
|
Fri Dec 9 15:14:41 2011 |
Den | Update | CDS | release 2.4 |
Alex has created a 2.4 branch of the RCD. Jamie, we can try to compile and install it. As a test a did it for c1oaf, it compiles, installs and runs once variables SITE, IFO, RCD_LIBRARY_PATH are properly defined. As we do not want to run one model at 2.4 code and others at 2.1, I recompiled c1oaf back to 2.1. Jamie, please, let me know when you are ready to upgrade to 2.4 release. |
6106
|
Mon Dec 12 13:02:08 2011 |
kiwamu | Update | CDS | daqd restarted |
I have restarted the daqd process at 1:01 PM since I have added some new ALS's daq channels. |
6124
|
Thu Dec 15 11:47:43 2011 |
jamie | Update | CDS | RTS UPGRADE IN PROGRESS |
I'm now in the middle of upgrading the RTS to version 2.4.
All RTS systems will be down until futher notice... |
6125
|
Thu Dec 15 22:22:18 2011 |
jamie | Update | CDS | RTS upgrade aborted; restored to previous settings |
Unfortunately, after working on it all day, I had to abort the upgrade and revert the system back to yesterday's state.
I think I got most of the upgrade working, but for some reason I could never get the new models to talk to the framebuilder. Unfortunately, since the upgrade procedure isn't document anywhere, it was really a fly by the seat of my pants thing. I got some help from Joe, which got me through one road block, but I ultimately got stumped.
I'll try to post a longer log later about what exactly I went through.
In any event, the system is back to the state is was in yesterday, and everything seems to be working. |
6149
|
Mon Dec 26 12:04:41 2011 |
kiwamu | Update | CDS | c1gcy.ini hand edited |
I have edited c1scx.ini by hand in order to acquire some green locking related channels.
Somehow c1sus.ini, c1mcs.ini, c1scx.ini and c1scy.ini are not accessible via the daqconfig script.
As far as I remember it had been accessible via daqconfig a week ago when I edited c1scy.ini.
Anyway I had to edit it by hand. They need to be fixed at some point |
6173
|
Thu Jan 5 09:59:27 2012 |
Jamie | Update | CDS | RTS/RCG/DAQ UPGRADE TO COMMENCE |
RTS/RCG/DAQ UPGRADE TO COMMENCE
I will be attempting (again) to upgrade the RTS, including the RCG and the daqd, to version 2.4 today. The RTS will be offline until further notice. |
6174
|
Thu Jan 5 20:40:21 2012 |
Jamie | Update | CDS | RTS upgrade aborted; restored to previous settings; fb symmetricom card failing? |
After running into more problems with the upgrade, I eventually decided to abort todays upgrade attempt, and revert back to where we were this morning (RTS 2.1). I'll try to follow this with a fuller report explaining what problems I encountered when attempting the upgrade.
However, when Alex and I were trying to figure out what was going wrong in the upgrade, it appears that the fb symmetricom card lost the ability to sync with the GPS receiver. When the symmeticom module is loaded, dmesg shows the following:
[ 285.591880] Symmetricom GPS card on bus 6; device 0
[ 285.591887] PIC BASE 2 address = fc1ff800
[ 285.591924] Remapped 0x17e2800
[ 285.591932] Current time 947125171s 94264us 800ns
[ 285.591940] Current time 947125171s 94272us 600ns
[ 285.591947] Current time 947125171s 94280us 200ns
[ 285.591955] Current time 947125171s 94287us 700ns
[ 285.591963] Current time 947125171s 94295us 800ns
[ 285.591970] Current time 947125171s 94303us 300ns
[ 285.591978] Current time 947125171s 94310us 800ns
[ 285.591985] Current time 947125171s 94318us 300ns
[ 285.591993] Current time 947125171s 94325us 800ns
[ 285.592001] Current time 947125171s 94333us 900ns
[ 285.592005] Flywheeling, unlocked...
Because of this, the daqd doesn't get the proper timing signal, and consequently is out of sync with the timing from the models.
It's completely unclear what caused this to happen. The card seemed to be working all day today, then Alex and I were trying to debug some other(maybe?) timing issues and the symmetricom card all of a sudden stopped syncing to the GPS. We tried rebooting the frame builder and even tried pulling all the power to the machine, but it never came back up. We checked the GPS signal itself and to the extend that we know what that signal is supposed to look like it looked ok.
I speculate that this is also the cause of the problems were were seeing earlier in the week. Maybe the symmetricom card has just been acting flaky, and something we did pushed it over the edge.
Anyway, we will try to replace it tomorrow, but Alex is skeptical that we have a replacement of this same card. There may be a newer Spectracom card we can use, but there may be problems using it on the old sun hardware that the fb is currently running on. We'll see.
In the mean time, the daqd is running rogue, off of it's own timing. Surprisingly all of the models are currently showing 0x0 status, which means no problems. It doesn't seem to be recording any data, though. Hopefully we'll get it all sorted out tomorrow. |
6175
|
Fri Jan 6 01:00:56 2012 |
kiwamu | Update | CDS | c1scx out of sync |
Both the c1scx and its IOP realtime processes became out of sync.
Initially I found that the c1scx didn't show any ADC signals, though the sync sign was green.
Then I software-rebooted the c1iscex machine and then it became out of sync.
For tonight this is fine because I am concentrating on the central part anyway. |
6176
|
Fri Jan 6 11:49:13 2012 |
Jamie | Update | CDS | framebuilder taken offline to diagnose problem with symmetricom timing card |
Alex and I have taken the framebuilder offline to try to see what's wrong with the symmetricom card. We have removed the card from the chassis and Alex has taken it back to downs to do some more debugging.
We have been formulating some alternate methods to get timing to the fb in case we can't end up getting the card working. |
6177
|
Fri Jan 6 14:31:54 2012 |
Jamie | Update | CDS | framebuilder back online, using NTP time syncronization |
The framebuilder is back online now, minus it's symmetricom GPS card. The card seems to have failed entirely, and was not able to be made to work at downs either. It has been entirely removed from fb.
As a fall back, the system has been made to work off of the system NTP-based time synchronization. The latest symmetricom driver, which is part of the RCG 2.4 branch, will fall back to using local time if the GPS synchronization fails. The new driver was compiled from our local checkout of the 2.4 source in the new to-be-used-in-the-future rtscore directory:
controls@fb ~ 0$ diff {/opt/rtcds/rtscore/branches/branch-2.4/src/drv/symmetricom,/lib/modules/2.6.34.1/kernel/drivers/symmetricom}/symmetricom.ko
controls@fb ~ 0$
The driver was reloaded. daqd was also linked against the last running stable version and restarted:
controls@fb ~ 0$ ls -al $(readlink -f /opt/rtcds/caltech/c1/target/fb/daqd)
-rwxr-xr-x 1 controls controls 6592694 Dec 15 21:09 /opt/rtcds/caltech/c1/target/fb/daqd.20120104
controls@fb ~ 0$
We'll have to keep an eye on the system, to see that it continues to record data properly, and that the fb and the front-ends remain in sync.
The question now is what do we do moving forward. CDS is not supporting the symmetricom cards anymore, and have moved to using Spectracom GPS/IRIG-B cards. However, Downs has neither at the moment. Even if we get a new Spectracom card, it might not work in this older Sun hardware, in which case we might need to consider upgrading the framebuilder to a new machine (one supported by CDS). |
6179
|
Fri Jan 6 20:10:48 2012 |
rana | Update | CDS | framebuilder back online, using NTP time syncronization |
You (Jamie) should talk with Rolf and Alex to find out what framebuilder they will support for > 3 years. Then we should buy that along with the adapter card which allows us to use the same RAID we now have for the frames. |
6182
|
Mon Jan 9 23:52:15 2012 |
kiwamu | Update | CDS | SUS channels not accessible from dataviewer |
[John / Kiwamu]
We found that some of the suspensions channels (for example C1:SUS-BS_POS_IN1 and etc) were not accessible from dataviewer for some reasons.
So far it seems none of the channels associated with c1sus are accessible from dataviewer. |
6204
|
Tue Jan 17 02:44:59 2012 |
kiwamu | Update | CDS | awg not working |
AWG is not working. This needs to be fixed.
I could set the channel and the parameters in the AWGGUI screen, but it never inject signals to the realtime system. |
6207
|
Tue Jan 17 16:09:20 2012 |
kiwamu | Update | CDS | awg not working on the c1sus machine |
Actually awg works fine without any problems when the excitation channels belong to the c1lsc machine.
It seems that the awg doesn't inject signals on the channels of the c1sus machine, for example C1:SUS-BS_LSC_EXC and so on.
Quote from #6204 |
AWG is not working. This needs to be fixed.
|
|
6319
|
Fri Feb 24 23:14:09 2012 |
kiwamu | Update | CDS | tdsavg went crazy |
I found that the LSCoffset script didn't work today. The script is supposed to null the electrical offsets in all the LSC channels.
I went through the sentences in the script and eventually found that the tdsavg command returns 0 every time.
I thought this was related to the test points, so I ran the following commands to flush all the test point running and the issue was solved.
[term]> diag
[diag]>open
[diag]> diag tp clear *
EDIT, JCD 11June2012: 3rd line there should just be [diag]> tp clear * |
6327
|
Mon Feb 27 19:04:13 2012 |
jamie | Update | CDS | spontaneous timing glitch in c1lsc IO chassis? |
For some reason there appears to have been a spontaneous timing glitch in the c1lsc IO chassis that caused all models running on c1lsc to loose timing sync with the framebuilder. All the models were reporting "0x4000" ("Timing mismatch between DAQ and FE application") in the DAQ status indicator. Looking in the front end logs and dmesg on the c1lsc front end machine I could see no obvious indication why this would have happened. The timing seemed to be hooked up fine, and the indicator lights on the various timing cards were nominal.
I restarted all the models on c1lsc, including and most importantly the c1x04 IOP, and things came back fine. Below is the restart procedure I used. Note I killed all the control models first, since the IOP can't be restarted if they're still running. I then restarted the IOP, followed by all the other control models.
controls@c1lsc ~ 0$ for m in lsc ass oaf; do /opt/rtcds/caltech/c1/scripts/killc1${m}; done
controls@c1lsc ~ 0$ /opt/rtcds/caltech/c1/scripts/startc1x04
c1x04epics C1 IOC Server started
* Stopping IOP awgtpman ... [ ok ]
controls@c1lsc ~ 0$ for m in lsc ass oaf; do /opt/rtcds/caltech/c1/scripts/startc1${m}; done
c1lscepics: no process found
ERROR: Module c1lscfe does not exist in /proc/modules
c1lscepics C1 IOC Server started
* WARNING: awgtpman_c1lsc has not yet been started.
c1assepics: no process found
ERROR: Module c1assfe does not exist in /proc/modules
c1assepics C1 IOC Server started
* WARNING: awgtpman_c1ass has not yet been started.
c1oafepics: no process found
ERROR: Module c1oaffe does not exist in /proc/modules
c1oafepics C1 IOC Server started
* WARNING: awgtpman_c1oaf has not yet been started.
controls@c1lsc ~ 0$
|
6392
|
Fri Mar 9 11:59:38 2012 |
Zweizig the ELOG Maven | Summary | CDS | NDS2 restart |
Hi Rana,
It looks like the channe list file has a few blank lines that the channel list reader is choking on. I removed the lines and it is working now.. I have made the error message a bit more obvious (gave the file name and line number) and allowed it to ignore empty lines so this won't cause problems with future versions (when installed). The bottom line is nds2 is now running on mafalda.
Best regards,
John  |
6397
|
Fri Mar 9 20:44:24 2012 |
Jim Lough | Update | CDS | DAQ restart with new ini file |
DAQ reload/restart was performed at about 1315 PST today. The previous ini file was backed up as c1pem20120309.ini in the /chans/daq/working_backups/ directory.
I set the following to record:
The two JIMS channels at 2048:
[C1:PEM-JIMS_CH1_DQ] Persistent version of JIMS channel. When bit drops to zero indicating something bad (BLRMS threshold exceeded) happens the bit stays at zero for >= the value of the persist EPICS variable.
[C1:PEM-JIMS_CH2_DQ] Non-persistent version of JIMS channel.
And all of the BLRMS channels at 256:
Names are of the form:
[C1:PEM-RMS_ACC1_F0p1_0p3_DQ]
[C1:PEM-RMS_ACC1_F0p3_1_DQ]
On monday I intend to look at the weekend seismic data to establish thresholds on the JIMS channels.
256 was the lowest rate possible according to the RCG manual. The JIMS channels are recorded at 2048 because I couldn't figure out how to disable the decimation filter. I will look into this further. |
6404
|
Tue Mar 13 13:28:31 2012 |
Ryan Fisher | Update | CDS | DAQ restart with new ini file |
Extra note: This was the ini file that was edited:
/cvs/cds/rtcds/caltech/c1/chans/daq/C1PEM.ini |
6422
|
Thu Mar 15 08:48:40 2012 |
Ryan | Summary | CDS | Summary of Syracuse Visit to 40m Mar 5-9 2012 |
JIMS Channels in PEM Model
The PEM model has been modified now to include the JIMS(Joint Information Management System) channel processing. Additionally Jim added test points at the outputs of the BLRMS.
For each seismometer channel, five bands are compared to threshold values to produce boolean results. Bands with RMS below threshold produce bits with value 1, above threshold results in 0. These bits are combined to produce one output channel that contains all of the results.
A persistent version of the channel is generated by a new library block that called persist which holds the value at 0 for a number of time steps equal to an EPICS variable setting from the time the boolean first drops to zero. The persist allows excursions shorter than the timestep of a downsampled timeseries to be seen reliably.
The EPICS variables for the thresholds are of the form (in order of increasing frequency):
C1:PEM-JIMS_GUR1X_THRES1
C1:PEM-JIMS_GUR1X_THRES2
etc.
The EPICS variables for the persist step size are of the form:
C1:PEM-JIMS_GUR1X_PERSIST
C1:PEM-JIMS_GUR1Y_PERSIST
etc.
The JIMS Channels are being recorded and written to frames:
The two JIMS channels at 2048:
[C1:PEM-JIMS_CH1_DQ] Persistent version of JIMS channel. When bit drops to zero indicating something bad (BLRMS threshold exceeded) happens the bit stays at zero for >= the value of the persist EPICS variable.
[C1:PEM-JIMS_CH2_DQ] Non-persistent version of JIMS channel.
And all of the BLRMS channels at 256:
Names are of the form:
[C1:PEM-RMS_ACC1_F0p1_0p3_DQ]
[C1:PEM-RMS_ACC1_F0p3_1_DQ]
For additional details about the JIMS Channels and the implementation, please see the previous elog entries by Jim.
Conlog
I have a working aLIGO Conlog/EPICS Log installed and running on megatron.
Please see this wiki page for the details of use:
https://wiki-40m.ligo.caltech.edu/aLIGO%20EPICs%20log%20%28conlog%29
I also edited this page with restart instructions for megatron:
https://wiki-40m.ligo.caltech.edu/Computer_Restart_Procedures#megatron
Please see Ryan's previous elog entries for installation details.
Future Work
- Determine useful thresholds for each band
- Generate MEDM Screens for JIMS Channels
- Add a decimation option to channels
- Add EPICS Strings in PEM model to describe bits in JIMS Channels
- Add additional JIMS Channels: Testing additional characterization methods
- Implement a State Log on Megatron: Will Provide a 1Hz index into JIMS Channels
- Generate a single web page that allows access to aLIGO Conlog/EPICS Log and State Log
|
6436
|
Thu Mar 22 16:45:06 2012 |
kiwamu | Update | CDS | c1scx and c1scy not properly running |
It seems that neither c1scx nor c1scy is working properly as their ADC counts are showing digital-zeros.
However the IOPs, c1gcx and c1gcy look running fine, and also the IOPs seem successfully recognizing the ADCs according to dmesg.
Also there is one more confusing fact : c1scx and c1scy are synchronizing to the timing signal somehow.
I restarted the c1scx front end model to see if this helps, but unfortunately it didn't work.
As this is not the top priority concern for now, I am leaving them as they are now with the watchgods off.
(I may try hardware rebooting them in this evening)
Quote from #6434 |
The power was turned back on at 4pm It took some time for Suresh to restart the computers. We have damping but things are not perfect yet. Auto BURTH did not work well.
|
|
6438
|
Thu Mar 22 17:41:15 2012 |
suresh | Update | CDS | c1scx and c1scy not properly running |
Quote: |
It seems that neither c1scx nor c1scy is working properly as their ADC counts are showing digital-zeros.
Quote from #6434 |
The power was turned back on at 4pm It took some time for Suresh to restart the computers. We have damping but things are not perfect yet. Auto BURTH did not work well.
|
|
When Steve and I restarted the c1iscex and c1iscey computers after the power shutdown, the models within them did not start-up automatically. I had to start them manually from a terminal in the control room.
I also tried rebooting the FB a couple of times. Did not make any difference.
Manually starting the c1x05, c1scy and c1x01, c1scx models (with the Burt Restore button ON) did not resolve the issue of zeros in the epics screens. though it did re-establish timing. |
6439
|
Thu Mar 22 23:43:56 2012 |
Koji | Update | CDS | c1scx and c1scy not properly running |
Did you guys checked if the simplant switch is set to "REAL WORLD" mode?
Edit by KI:
Bingo ! The input signals were bypassed to the simplant. I switched the simplant settings to REAL WORLD and now both end suspensions are working fine. |
6540
|
Tue Apr 17 11:05:04 2012 |
Jamie | Update | CDS | CDS upgrade in progress |
I am continuing to attempt to upgrade the CDS system to RTS 2.5. Systems will continue to be up and down for the rest of the day. |
6541
|
Tue Apr 17 19:03:09 2012 |
Jamie | Update | CDS | CDS upgrade in progress |
Upgrade progresses, but not complete. There are some relatively minor issues, and one potentially big issue.
All new software has been installed, including the new epics that supports long channel names.
I've been doing a LOT of cleanup. It was REALLY messy in there.
The new framebuilder/daqd code is running on fb.
Models are compiling with the new RCG and I am able to get them running. Some of them are not compiling for relatively minor reasons (the simulink models need updating). I'm also running into compile problems with IOPs that are using the dolphin drivers.
The major issue is that the framebuilder and the models are not syncing their timing, so there's no data collection. I've spoken to Alex and he and Rolf are going to come over tomorrow to sort it out. It's possible that we're missing timing hardware that the new code is expecting.
There are still some stability issues I haven't sorted out yet, and I have a lot more cleanup to do.
At this rate I'm going to shoot for being done Thursday. |
6546
|
Wed Apr 18 19:59:48 2012 |
Jamie | Update | CDS | CDS upgrade success |
The upgrade is nearly complete:
- new daqd code is running on fb
- the fe/daqd timing issue was resolved by adjusting the GPS offset in the daqdrc. I will document this more later.
- the power outage conveniently rebooted all the front-end machines, so they're all now running new caRepeater
- all models have been successfully recompiled with RCG 2.5 (with only a couple small glitches)
- all new models are running on all front-end machines (with a couple exceptions)
- all suspension models seem to be damping under local control (PRM is having troubles that are likely unrelated to the upgrade).
- a lot of cleanup has been done
Remaining tasks/issues:
- more testing OF EVERYTHING needs o be done
- I did not yet update the DIS dolphin code, so we're running with the old code. I don't think this is a problem, but it would be nice to get us running what they're running at the sites
- I tried to cleanup/simplify how front-end initialization is done. However, there is a problem and models are not auto-starting after reboot. This needs to be fixed.
- the userapps directory is in a new place (/opt/rtcds/userapps). Not everything in the old location was checked into the repository, so we need to check to make sure everything that needs to be is checked in, and that all the models are running the right code.
- the c1oaf model seems to be having a dolphin issue that needs to be sorted
- the c1gfd model causes c1ioo to crash immediately upon being loaded. I have removed it from the rtsystab. That model needs to be fixed.
- general model cleanup is in order.
- more front-end cleanup is needed, particularly in regards to boot-up procedure.
- document the entire upgrade procedure.
I'll finish up these remaining tasks tomorrow. |
6547
|
Wed Apr 18 23:12:49 2012 |
Den | Update | CDS | oaf |
Adaptive filter outputs some non-zero signal in the OFF position

I turned "ON" one of them and c1lsc suspended, I've rebooted it and restarted models on c1lsc and c1sus.
Now it also outputs something non-zero though the first line of the adaptive code is "if(OFF) output=0.0; return;" May be another version of the code has been compiled.
Edit: Old version (~september) of the code and oaf model is running now. In the 2.1 code there was a link from src/epics/simLink to oaf code for each DOF. It seems that 2.5 version finds models and c codes in standard directories. I need to move working code to the proper directory. |