40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 291 of 341  Not logged in ELOG logo
ID Date Author Type Categorydown Subject
  6578   Fri Apr 27 09:00:15 2012 JamieUpdateCDSsus watchdogs?

Why are all the suspension watchdogs tripped?  None of the suspension models are running on c1ioo, so they should be completely unaffected.  Steve, did you find them tripped, or did you shut them off?

In either event they should be safetly turned back on.

  6579   Fri Apr 27 09:27:48 2012 DenUpdateCDSsus watchdogs?

Quote:

Why are all the suspension watchdogs tripped?  None of the suspension models are running on c1ioo, so they should be completely unaffected.  Steve, did you find them tripped, or did you shut them off?

In either event they should be safetly turned back on.

 I've turned off the coils. Though non of them are on the c1ioo, who knows what can happen when we'll try to run the models again.

  6580   Fri Apr 27 12:12:14 2012 DenUpdateCDSc1ioo is back

Rolf came to the 40m today and managed to figure out what the problem is. Reading just dmesg was not enough to solve the problem. Useful log was in

>> cat /opt/rtcds/caltech/c1/target/c1x03/c1x03epics/iocC1.log

Starting iocInit
The CA server's beacon address list was empty after initialization?
iocRun: All initialization complete
sh: iniChk.pl: command not found
Failed to load DAQ configuration file

iniChk.pl checks the .ini file of the model.

>> cat /opt/rtcds/rtscore/release/src/drv/param.c


int loadDaqConfigFile(DAQ_INFO_BLOCK *info, char *site, char *ifo, char *sys)
{

  strcpy(perlCommand, "iniChk.pl ");
  .........
  strcat(perlCommand, fname); // fname - name of the .ini file
  ..........
}

So the problem was not in the C1X03.ini. The code could not find the perl script though it was in the /opt/rtcds/caltech/c1/scripts directory. Some environment variables are not set. Rolf added /opt/rtcds/caltech/c1/scripts/ to $PATH variable and c1ioo models (x03, ioo, gcv) started successfully. He is not sure whether this is a right way or not, because other machines also do not have "scripts" directory in their PATH variable.

>> cat /opt/rtcds/caltech/c1/target/c1x03/c1x03epics/iocC1.log

Starting iocInit
The CA server's beacon address list was empty after initialization?
iocRun: All initialization complete

Total count of 'acquire=0' is 2
Total count of 'acquire=1' is 0
Total count of 'acquire=0' and 'acquire=1' is 2

Counted 0 entries of datarate=256     for a total of 0
Counted 0 entries of datarate=512     for a total of 0
Counted 0 entries of datarate=1024     for a total of 0
Counted 0 entries of datarate=2048     for a total of 0
Counted 0 entries of datarate=4096     for a total of 0
Counted 0 entries of datarate=8192     for a total of 0
Counted 0 entries of datarate=16384     for a total of 0
Counted 0 entries of datarate=32768     for a total of 0
Counted 2 entries of datarate=65536     for a total of 131072

Total data rate is 524288 bytes - OK

Total error count is 0

Rolf mentioned about automatic set up of variables - kiis of smth like that - probably that script is not working correctly. Rolf will add this problem to his list.

  6582   Mon Apr 30 13:00:50 2012 SureshUpdateCDSFrame Builder is down

Frame builder is down.  PRM has tripped its watch dogs.  I have reset the watch dog on PRM and turned on the OPLEV. It has damped down.  Unable to check what happened since FB is not responding.

There was an minor earthquake yesterday morning which people could feel a few blocks away.  It could have caused the the PRM to unlock.

Jamie,Rolf,  is it okay or us to restart the FB?  

  6583   Mon Apr 30 13:58:25 2012 JamieUpdateCDSFrame Builder is down

Quote:

Frame builder is down.  PRM has tripped its watch dogs.  I have reset the watch dog on PRM and turned on the OPLEV. It has damped down.  Unable to check what happened since FB is not responding.

There was an minor earthquake yesterday morning which people could feel a few blocks away.  It could have caused the the PRM to unlock.

Jamie,Rolf,  is it okay or us to restart the FB?  

 If it's down it's alway ok to restart it.  If it doesn't respond or immediately crashes again after restart then it might require some investigation, but it should always be ok to restart it.

  6584   Mon Apr 30 16:56:05 2012 SureshUpdateCDSFrame Builder is down

Quote:

Quote:

Frame builder is down.  PRM has tripped its watch dogs.  I have reset the watch dog on PRM and turned on the OPLEV. It has damped down.  Unable to check what happened since FB is not responding.

There was an minor earthquake yesterday morning which people could feel a few blocks away.  It could have caused the the PRM to unlock.

Jamie,Rolf,  is it okay or us to restart the FB?  

 If it's down it's alway ok to restart it.  If it doesn't respond or immediately crashes again after restart then it might require some investigation, but it should always be ok to restart it.

I tried restarting the fb in two different ways.  Neither of them re-established the connection to dtt or epics.

1) I restarted the fb from the control room console with the 'shutdown' command.  No change.

2) I halted the machine with 'shutdown -h now' and restarted it with the hardware reset button on its front-panel. No change.

The console connected to the fb showed that the network file systems did not load.  Could this have resulted in failure to start several services since it could not find the files which are stored on the network file system?

The fb is otherwise healthy since I am able to ssh into it and browse the directory structure.

  6586   Mon Apr 30 20:43:33 2012 SureshUpdateCDSFrame Builder is down

Quote:

Quote:

Quote:

Frame builder is down.  PRM has tripped its watch dogs.  I have reset the watch dog on PRM and turned on the OPLEV. It has damped down.  Unable to check what happened since FB is not responding.

There was an minor earthquake yesterday morning which people could feel a few blocks away.  It could have caused the the PRM to unlock.

Jamie,Rolf,  is it okay or us to restart the FB?  

 If it's down it's alway ok to restart it.  If it doesn't respond or immediately crashes again after restart then it might require some investigation, but it should always be ok to restart it.

I tried restarting the fb in two different ways.  Neither of them re-established the connection to dtt or epics.

1) I restarted the fb from the control room console with the 'shutdown' command.  No change.

2) I halted the machine with 'shutdown -h now' and restarted it with the hardware reset button on its front-panel. No change.

The console connected to the fb showed that the network file systems did not load.  Could this have resulted in failure to start several services since it could not find the files which are stored on the network file system?

The fb is otherwise healthy since I am able to ssh into it and browse the directory structure.

[Mike, Rana]

The fb is okay.  Rana found that it works on Pianosa, but not on Allegra or Rossa.  It also works on Rosalba, on which Jamie recently installed Ubuntu.

The white fields on the medm  'Status' screen for fb are an unrelated problem.

 

 

  6591   Tue May 1 08:18:50 2012 JamieUpdateCDSFrame Builder is down

Quote:

 

I tried restarting the fb in two different ways.  Neither of them re-established the connection to dtt or epics.

 Please be conscious of what components are doing what.  The problem you were experiencing was not "frame builder down".  It was "dtt not able to connect to frame builder".  Those are potentially completely different things.  If the front end status screens show that the frame builder is fine, then it's probably not the frame builder.

Also "epics" has nothing whatsoever to do with any of this.  That's a completely different set of stuff, unrelated to DTT or the frame builder.

  6599   Thu May 3 19:52:43 2012 JenneUpdateCDSOutput errors from dither alignment (Xarm) script

epicsThreadOnceOsd epicsMutexLock failed.
Segmentation fault
Number found where operator expected at -e line 1, near "0 0"
        (Missing operator before  0?)
syntax error at -e line 1, near "0 0"
Execution of -e aborted due to compilation errors.
Number found where operator expected at (eval 1) line 1, near "*  * 50"
        (Missing operator before  50?)
epicsThreadOnceOsd epicsMutexLock failed.
Segmentation fault
Number found where operator expected at -e line 1, near "0 0"
        (Missing operator before  0?)
syntax error at -e line 1, near "0 0"
Execution of -e aborted due to compilation errors.
syntax error at -e line 1, at EOF
Execution of -e aborted due to compilation errors.
Number found where operator expected at (eval 1) line 1, near "*  * 50"
        (Missing operator before  50?)
epicsThreadOnceOsd epicsMutexLock failed.
status : 0
I am going to execute the following commands
ezcastep -s 0.6 C1:SUS-ETMX_PIT_COMM +-0,50
ezcastep -s 0.6 C1:SUS-ITMX_PIT_COMM +,50
ezcastep -s 0.6 C1:SUS-ETMX_YAW_COMM +,50
ezcastep -s 0.6 C1:SUS-ITMX_YAW_COMM +-0,50
ezcastep -s 0.6 C1:SUS-BS_PIT_COMM +0,50
ezcastep -s 0.6 C1:SUS-BS_YAW_COMM +0,50
hit a key to execute the commands above

  6609   Sun May 6 00:11:00 2012 DenUpdateCDSmx_stream

c1sus and c1iscex computers could not connect to framebuilder, I restarted it, did not help. Then I restarted mx_stream daemon on each of the computers and this fixed the problem.

sudo /etc/init.d/mx_stream restart

  6616   Mon May 7 21:05:38 2012 DenUpdateCDSbiquad filter form

I wanted to switch the implementation of IIR_FILTER from DIRECT FORM II to BIQUAD form in C1IOO and C1SUS models. I modified RCG file /opt/rtcds/rtscore/release/src/fe/controller.c by adding #define CORE_BIQUAD line:

#ifdef OVERSAMPLE
#define CORE_BIQUAD      
#if defined(CORE_BIQUAD)

C1IOO model compiled, installed and is running now. C1SUS model compiled, but during installation I've got an error:

controls@c1sus ~ 0$ rtcds install c1sus


Installing system=c1sus site=caltech ifo=C1,c1
Installing /opt/rtcds/caltech/c1/chans/C1SUS.txt
Installing /opt/rtcds/caltech/c1/target/c1sus/c1susepics
Installing /opt/rtcds/caltech/c1/target/c1sus
Installing start and stop scripts
/opt/rtcds/caltech/c1/scripts/killc1sus
Performing install-daq
Updating testpoint.par config file
/opt/rtcds/caltech/c1/target/gds/param/testpoint.par
/opt/rtcds/rtscore/branches/branch-2.5/src/epics/util/updateTestpointPar.pl -par_file=/opt/rtcds/caltech/c1/target/gds/param/archive/testpoint_120507_205359.par -gds_node=21 -site_letter=C -system=c1sus -host=c1sus
Installing GDS node 21 configuration file
/opt/rtcds/caltech/c1/target/gds/param/tpchn_c1sus.par
Installing auto-generated DAQ configuration file
/opt/rtcds/caltech/c1/chans/daq/C1SUS.ini
Installing EDCU ini file
/opt/rtcds/caltech/c1/chans/daq/C1EDCU_SUS.ini
Installing Epics MEDM screens
Running post-build script

ERROR: Could not find file: test.py
Searched path: :/opt/rtcds/userapps/release/cds/c1/scripts:/opt/rtcds/userapps/release/cds/common/scripts:/opt/rtcds/userapps/release/isc/c1/scripts:/opt/rtcds/userapps/release/isc/common/scripts:/opt/rtcds/userapps/release/sus/c1/scripts:/opt/rtcds/userapps/release/sus/common/scripts:/opt/rtcds/userapps/release/psl/c1/scripts:/opt/rtcds/userapps/release/psl/common/scripts
Exiting
make: *** [install-c1sus] Error 1

Jamie, what is this test.py?

  6618   Mon May 7 21:46:10 2012 DenUpdateCDSguralp signal error

GUR1_XYZ_IN1 and GUR2_XYZ_IN1 are the same and equal to GUR2_XYZ.  This is bad since GUR1_XYZ_IN1 should be equal to GUR1_XYZ.  Note that GUR#_XYZ are copies of GUR#_XYZ_OUT, so there may be (although there isn't right now) filtering between the _IN1's and the _OUT's.  But certainly GUR1 should look like GUR1, not GUR2!!!

Looks like CDS problem, maybe some channel-hopping going on? I'm trying a restart of the c1sus computer right now, to see if that helps.....

Figure:  Green and red should be the same, yellow and blue should be the same.  Note however that green matches yellow and blue, not red.  Bad.

guralps.png

 

 

  6619   Mon May 7 22:39:37 2012 DenUpdateCDSc1sus

[Jenne, Den]

We decided to reboot C1SUS machine in hope that this will fix the problem with seismic channels. After reboot the machine could not connect to framebuilder. We restarted mx_stream but this did not relp. Then we manually executed

/opt/rtcds/caltech/c1/target/fb/mx_stream -s c1x02 c1sus c1mcs c1rfm c1pem -d fb:0 -l /opt/rtcds/caltech/c1/target/fb/mx_stream_logs/c1sus.log

but c1sus still could not connect to fb. This script returned the following error:

controls@c1sus ~ 128$ cat /opt/rtcds/caltech/c1/target/fb/mx_stream_logs/c1sus.log


c1x02
c1sus
c1mcs
c1rfm
c1pem
mmapped address is 0x7fb5ef8cc000
mapped at 0x7fb5ef8cc000
mmapped address is 0x7fb5eb8cc000
mapped at 0x7fb5eb8cc000
mmapped address is 0x7fb5e78cc000
mapped at 0x7fb5e78cc000
mmapped address is 0x7fb5e38cc000
mapped at 0x7fb5e38cc000
mmapped address is 0x7fb5df8cc000
mapped at 0x7fb5df8cc000
send len = 263596
OMX: Failed to find peer index of board 00:00:00:00:00:00 (Peer Not Found in the Table)
mx_connect failed

Looks like CDS error. We are leaving the WATCHDOGS OFF for the night.

  6622   Tue May 8 09:47:53 2012 JamieUpdateCDSbiquad filter form

Quote:

I wanted to switch the implementation of IIR_FILTER from DIRECT FORM II to BIQUAD form in C1IOO and C1SUS models. I modified RCG file /opt/rtcds/rtscore/release/src/fe/controller.c by adding #define CORE_BIQUAD line:

#ifdef OVERSAMPLE
#define CORE_BIQUAD      
#if defined(CORE_BIQUAD)

 I am really not ok with anyone modifying controller.c.  If we're going to be messing around with that we need to change procedure significantly.  This is the code that runs all the models, and we don't currently have any way to track changes in the code.

Did you change it back?  If not, do so immediately and stop messing with it.  Please consult with us first before embarking on these kinds of severe changes to our code.  This is the kind of shit that other people have done that has bit us in the ass in the past.

Futhermore, there is already a way to enable biquad filters in the new version with out modifying the RCG source.  All you need to do is set biquad=1 in the cdsParameters block for you model.

DO NOT MESS WITH CONTROLLER.C!

  6623   Tue May 8 09:58:17 2012 DenUpdateCDSSUS -> FB

 [Alex, Den] 

It was in vain to restart mx_stream yesterday as C1SUS did not see FB

controls@c1sus ~ 0$ /opt/open-mx/bin/omx_info 

Open-MX version 1.3.901
 build: root@fb:/root/open-mx-1.3.901 Wed Feb 23 11:13:17 PST 2011
Found 1 boards (32 max) supporting 32 endpoints each:
 c1sus:0 (board #0 name eth1 addr 00:25:90:06:59:f3)
   managed by driver 'igb'
Peer table is ready, mapper is 00:60:dd:46:ea:ec
================================================
  0) 00:25:90:06:59:f3 c1sus:0
  1) 00:60:dd:46:ea:ec fb:0                           // this line was missing
  2) 00:14:4f:40:64:25 c1ioo:0
  3) 00:30:48:be:11:5d c1iscex:0
  4) 00:30:48:bf:69:4f c1lsc:0
  5) 00:30:48:d6:11:17 c1iscey:0
 
At the same time FB saw C1SUS:
 
controls@fb ~ 0$ /opt/mx/bin/mx_info
 
MX Version: 1.2.12
MX Build: root@fb:/root/mx-1.2.12 Mon Nov  1 13:34:38 PDT 2010
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0:  299.8 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
Status: Running, P0: Link Up
Network: Ethernet 10G

MAC Address: 00:60:dd:46:ea:ec
Product code: 10G-PCIE-8AL-S
Part number: 09-03916
Serial number: 352143
Mapper: 00:60:dd:46:ea:ec, version = 0x00000000, configured
Mapped hosts: 6

                                                        ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                        P0
-----    -----------     ---------                        ---
   0) 00:60:dd:46:ea:ec fb:0                              1,0
   1) 00:30:48:d6:11:17 c1iscey:0                         1,0
   2) 00:30:48:be:11:5d c1iscex:0                         1,0
   3) 00:30:48:bf:69:4f c1lsc:0                           1,0
   4) 00:25:90:06:59:f3 c1sus:0                           1,0
   5) 00:14:4f:40:64:25 c1ioo:0                           1,0
 
For that reason when I restarted mx_stream on c1sus, the script tried to connect to the standard 00:00:00:00:00:00 address, as the true address was not specified.
 
Alex restarted mx on FB. Note, DAQD process will not allow one to do that until it runs, at the same time, you can't just kill it, it will restart automatically. For that reason one should open /etc/inittab and replace respawn to stop in the line
 
daq:345:respawn:/opt/rtcds/caltech/c1/target/fb/start_daqd.inittab
 
then execute inittab using init q and restart mx on the FB
 
controls@fb ~ 0$ sudo /sbin/init q
controls@fb ~ 0$ sudo /etc/init.d/mx restart
 

After that C1SUS started to communicate with FB. But the reason why this happened and how to prevent from this in future Alex does not know.

Restarting DAQD process (or may be C1SUS) also solved the problem with guralp channels, now they are fine. Again, why this happened is unknown.

 

  6624   Tue May 8 10:43:42 2012 DenUpdateCDSbiquad filter form

Quote:

Quote:

I wanted to switch the implementation of IIR_FILTER from DIRECT FORM II to BIQUAD form in C1IOO and C1SUS models. I modified RCG file /opt/rtcds/rtscore/release/src/fe/controller.c by adding #define CORE_BIQUAD line:

#ifdef OVERSAMPLE
#define CORE_BIQUAD      
#if defined(CORE_BIQUAD)

 I am really not ok with anyone modifying controller.c.  If we're going to be messing around with that we need to change procedure significantly.  This is the code that runs all the models, and we don't currently have any way to track changes in the code.

Did you change it back?  If not, do so immediately and stop messing with it.  Please consult with us first before embarking on these kinds of severe changes to our code.  This is the kind of shit that other people have done that has bit us in the ass in the past.

Futhermore, there is already a way to enable biquad filters in the new version with out modifying the RCG source.  All you need to do is set biquad=1 in the cdsParameters block for you model.

DO NOT MESS WITH CONTROLLER.C!

 ok

  6625   Tue May 8 16:43:15 2012 JenneUpdateCDSDegenerate channels, potentially a big mess

Rana theorized that we're having problems with the MC error signal in the OAF model (separate elog by Den to follow) because we've named a channel "C1:IOO-MC_F", and such a channel already used to exist.  So, Rana and I went out to do some brief cable tracing.

MC Servo Board has 3 outputs that are interesting:  "DAQ OUT" which is a 4-pin LEMO, "SERVO OUT" which is a 2-pin LEMO, and "OUT1", which is a BNC->2pin LEMO right now.

DAQ OUT should have the actal MC_F signal, which goes through to the laser's PZT.  This is the signal that we want to be using for the OAF model.

SERVO OUT should be a copy of this actual MC_F signal going to the laser's PZT.  This is also acceptable for use with the OAF model.

OUT1 is a monitor of the slow(er) MC_L signal, which used to be fed back to the MC2 suspension.  We want to keep this naming convention, in case we ever decide to go back and feed back to the suspensions for freq. stabilization.

Right now, OUT1 is going to the first channel of ADC0 on c1ioo.  SERVOout is going to the 7th channel on ADC0.  DAQout is going to the ~12th channel of ADC1 on c1ioo.  OUT1 and SERVOout both go to the 2-pin LEMO whitening board, which goes to some new aLIGO-style ADC breakout boards with ribbon cables, which then goes to ADC0.  DAQout goes to the 4pin LEMO ADC breakout, (J7 connector) which then directly goes to ADC1 on c1ioo.

So, to sum up, OUT1 should be "adc0_0" in the simulink model, SERVOout should be "adc0_6" on the simulink model, and DAQout should be "adc1_12" (or something....I always get mixed up with the channel counting on 4pin ADC breakout / AA boards). 

In the current simulink setup, OUT1 (adc0_0) is given the channel name C1:IOO-MC_F, and is fed to the OAF model.  We need to change it to C1:IOO-MC_L to be consistent with the old regime.

In the current simulink setup, SERVOout (adc0_6) is given the channel name C1:IOO-MC_SERVO.  It should be called C1:IOO-MC_F, and should go to the OAF model.

In the current simulink setup,DAQout (~adc1_12) doesn't go anywhere.  It's completely not in the system.  Since the cable in the back of this AA / ADC breakout board box goes directly to the c1ioo I/O chassis, I don't think we have a degenerate MC_F naming situation.  We've incorrectly labeled MC_L as MC_F, but we don't currently have 2 signals both called MC_F.

Okay, that doesn't explain precisely why we see funny business with the OAF model's version of MCL, but I think it goes in the direction of ruling out a degenerate MC_F name.

Problem:  If you look at the screen cap, both simulink models are running on the same computer (c1ioo), so when they both refer to ADC0, they're really referring to the same physical card.  Both of these models have adc0_6 defined, but they're defined as completely different things.  Since we can trace / see the cable going from the MC Servo Board to the whitening card, I think the MC_SERVO definition is correct.  Which means that this Green_PH_ADC is not really what it claims to be.  I'm not sure what this channel is used for, but I think we should be very cautious and look into this before doing any more green locking.  It would be dumb to fail because we're using the wrong signals.

 

  6626   Tue May 8 17:48:50 2012 JenneUpdateCDSOAF model not seeing MCL correctly

Den noticed this, and will write more later, I just wanted to sum up what Alex said / did while he was here a few minutes ago....

 

Errors are probably really happening.... c1oaf computer status 4-bit thing:  GRGG.  The Red bit is indicating receiving errors.  Probably the oaf model is doing a sample-and-hold thing, sampling every time (~1 or 2 times per sec) it gets a successful receive, and then holding that value until it gets another successful receive. 

Den is adding EPICS channels to record the ERR out of the PCIE dolphin memory CDS_PART, so that we can see what the error is, not just that one happened.

Alex restarted oaf model:  sudo rmmod c1oaf.ko, sudo insmod c1oaf.ko .  Clicked "diag reset" on oaf cds screen several times, nothing changed.  Restarted c1oaf again, same rmmod, insmod commands.

Den, Alex and I went into the IFO room, and looked at the LSC computer, SUS computer, SUS I/O chassis, LSC I/O chassis and the dolphin switch that is on the back of the rack, behind the SUS IO chassis.  All were blinking happily, none showed symptoms of errors.

Alex restarted the IOP process:  sudo rmmod c1x04, sudo insmod c1x04.  Chans on dataviewer still bad, so this didn't help, i.e. it wasn't just a synchronization problem.  oaf status: RRGG. lsc status: RGGG. ass status: RGGG.

sudo insmod c1lsc.ko, sudo insmod c1ass.ko, sudo insmod c1oaf.ko .  oaf status: GRGG. lsc status: GGGG. ass status: GGGG.  This means probably lsc needs to send something to oaf, so that works now that lsc is restarted, although oaf still not receiving happily.

Alex left to go talk to Rolf again, because he's still confused.

Comment, while writing elog later:  c1rfm status is RRRG, c1sus status is RRGG, c1oaf status is GRGG, both c1scy and c1scx are RGRG.  All others are GGGG.

  6627   Wed May 9 00:45:13 2012 JenneUpdateCDSNo signals for DTT from SUS

Upgrades suck.  Or at least making everything work again after the upgrade.

On the to-do list tonight:  look at OSEM sensor and OpLev spectra for PRM, when PRMI is locked and unlocked.  Goal is to see if the PRM is really moving wildly ("crazy" as Kiwamu always described it) when it's nicely aligned and PRMI is locked, or if it's an artifact of lever arm between PRM and the cameras (REFL and AS).

However, I can't get signals on DTT.  So far I've checked a bunch of signals for SUS-PRM, and they all either (a) are just digital 0 or (b) are ADC noise.  Lame.

Steve's elog 5630 shows what reasonable OpLev spectra should look like:  exactly what you'd expect.

Attached below is a small sampling of different SUS-PRM signals.  I'm going to check some other optics, other models on c1sus, etc, to see if I can narrow down where the problem is.  LSC signals are fine (I checked AS55Q, for example).

UPDATE:  SRM channels are same ADC noise.  MC1 channels are totally fine.  And Den had been looking at channels on the RFM model earlier today, which were fine.

ETMY channels - C1:SUS-ETMY_LLCOIL_IN1 and C1:SUS-ETMY_SUSPOS_IN1 both returned "unable to obtain measurement data".  OSEM sensor channels and OpLev _PERROR channel were digital zeros.

ETMX channels were fine

UPDATE UPDATE:  Genius me just checked the FE status screen again.  It was fine ~an hour ago when I sat down to start interferometer-izing for the night, but now the SUS model and both of the ETMY computer models are having problems connecting to the fb.  *sigh* 

  6628   Wed May 9 01:14:44 2012 JenneUpdateCDSNo signals for DTT from SUS

Quote:

UPDATE UPDATE:  Genius me just checked the FE status screen again.  It was fine ~an hour ago when I sat down to start interferometer-izing for the night, but now the SUS model and both of the ETMY computer models are having problems connecting to the fb.  *sigh* 

 Restarted SUS model - it's now happy. 

c1iscey is much less happy - neither the IOP nor the scy model are willing to talk to fb.  I might give up on them after another few minutes, and wait for some daytime support, since I wanted to do DRMI stuff tonight.

Yeah, giving up now on c1iscey (Jamie....ideas are welcome).  I can lock just fine, including the Yarm, I just can't save data or see data about ETMY specifically.  But I can see LSC data, so I can lock, and I can now take spectra of corner optics.

  6630   Wed May 9 08:21:42 2012 JamieUpdateCDSNo signals for DTT from SUS

Quote:

 c1iscey is much less happy - neither the IOP nor the scy model are willing to talk to fb.  I might give up on them after another few minutes, and wait for some daytime support, since I wanted to do DRMI stuff tonight.

Yeah, giving up now on c1iscey (Jamie....ideas are welcome).  I can lock just fine, including the Yarm, I just can't save data or see data about ETMY specifically.  But I can see LSC data, so I can lock, and I can now take spectra of corner optics.

 This is the mx_stream issue reported previously.  The symptom is that all models on a single front end loose contact with the frame builder, as opposed to *all* models on all front end loosing contact with the frame builder.  That indicates that the problem is a common fb communication issue on the single front end, and that's all handled with mx_stream.

ssh'ing into c1iscey and running "sudo /etc/init.d/mx_stream restart" fixed the problem.

  6632   Wed May 9 10:46:54 2012 DenUpdateCDSOAF model not seeing MCL correctly

Quote:

Den noticed this, and will write more later, I just wanted to sum up what Alex said / did while he was here a few minutes ago....

 From my point of view during rfm -> oaf transmission through Dolphin we loose a significant part of the signal. To check that I've created MEDM screen to monitor the transmission errors in the OAF model. It shows how many errors occurs per second. For MCL channel this number turned out to be 2046 +/- 1. This makes sense to me as the sampling rate is 2048 Hz => then we actually receive only 1-3 data points per second. We can see this in the dataviewer.

C1:OAF-MCL_IN follows C1:IOO-MC_F in the sense that the scale of 2 signals are the same in 2 states: MC locked and unlocked. It seems that we loose 2046 out of 2048 points per second.

oaf_rec.png

  6633   Wed May 9 11:31:50 2012 DenUpdateCDSRFM

I added PCIE memory cache flushing to c1rfm model by changing 0 to 1 in /opt/rtcds/rtscore/release/src/fe/commData2.c on line 159, recompiled and restarted c1rfm.

Jamie, do not be mad at me, Alex told me do that!

However, this did not help, C1RFM did not start. I decided to restart all models on C1SUS machine in hope that C1RFM uses some other models and can't connect to them but this suspended C1SUS machine. After reboot encounted the same C1SUS -> FB communication error and fixed it in the same was as in the previous case of C1SUS reboot. This happens already the second time (out of 2) after C1SUS machine reboot.

I changed /opt/rtcds/rtscore/release/src/fe/commData2.c back, recompiled and restarted c1rfm. Now everything is back. C1RFM -> C1OAF is still bad.

  6634   Wed May 9 14:32:31 2012 JenneUpdateCDSBurt restored

Den and Alex left things not-burt restored, and Den mentioned to me that it might need doing.

I burt restored all of our epics.snaps to the 1am today snapshot.  We lost a few hours of striptool trends on the projector, but now they're back (things like the BLRMS don't work if the filters aren't engaged on the PEM model, so it makes sense).

  6635   Wed May 9 15:02:50 2012 DenUpdateCDSRFM

Quote:

However, this did not help, C1RFM did not start. I decided to restart all models on C1SUS machine in hope that C1RFM uses some other models and can't connect to them but this suspended C1SUS machine.

 This happened because of the code bug -

// If PCIE comms show errors, may want to add this cache flushing
#if 1
if(ipcInfo[ii].netType == IPCIE)
          clflush_cache_range (&(ipcInfo[ii].pIpcData->dBlock[sendBlock][ipcIndex].data), 16); // & was missing - Alex fixed this
#endif
 

After this bug was fixed and the code was recompiled, C1:OAF_MCL_IN is OK, no errors occur during the transmission C1:OAF-MCL_ERR=0.

So the problem was in the PCIE card that could not send such amount of data and the last channel (MCL is the last) was corrupted. Now, when Alex added cache flushing, the problem is fixed.

We should spend some more attention to such problems. This time 2046 out of 2048 points were lost per second. But what if 10-20 points are lost, we would not notice that in the dataviewer, but this will cause problems.

  6639   Thu May 10 22:05:21 2012 DenUpdateCDSFB

Already for the second time today all computers loose connection to the framebuilder. When I ssh to framebuilder DAQD process was not running. I started it

controls@fb ~ 130$ sudo /sbin/init q

But I do not know what causes this problem. May be this is a memory issue. For FB

Mem:   7678472k total,  7598368k used,    80104k free

Practically all memory is used. If more is needed and swap is off, DAQD process may die.

  6640   Fri May 11 08:07:30 2012 JamieUpdateCDSFB

Quote:

Already for the second time today all computers loose connection to the framebuilder. When I ssh to framebuilder DAQD process was not running. I started it

controls@fb ~ 130$ sudo /sbin/init q

Just to be clear, "init q" does not start the framebuilder.  It just tells the init process to reparse the /etc/inittab.  And since init is supposed to be configured to restart daqd when it dies, it restarted it after the reloading of /etc/inittab.  You and Alex must have forgot to do that after you modified the inittab when you're were trying to fix daqd last week.

daqd is known to crash without reason.  It usually just goes unnoticed because init always restarts it automatically.  But we've known about this problem for a while.

Quote:

But I do not know what causes this problem. May be this is a memory issue. For FB

Mem:   7678472k total,  7598368k used,    80104k free

Practically all memory is used. If more is needed and swap is off, DAQD process may die.

This doesn't really mean anything, since the computer always ends up using all available memory.  It doesn't indicate a lack of memory.  If the machine is really running out of memory you would see lots of ugly messages in dmesg.

  6654   Mon May 21 21:27:39 2012 yutaUpdateCDSMEDM suspension screens using macro

Background:
 We need more organized MEDM screens. Let's use macro.

What I did:
1. Edited /opt/rtcds/userapps/trunk/sus/c1/medm/templates/SUS_SINGLE.adl using replacements below;

sed -i s/#IFO#SUS_#PART_NAME#/'$(IFO)$(SYS)_$(OPTIC)'/g SUS_SINGLE.adl
sed -i s/#IFO#SUS#_#PART_NAME#/'$(IFO)$(SYS)_$(OPTIC)'/g SUS_SINGLE.adl
sed -i s/#IFO#:FEC-#DCU_ID#/'$(IFO):FEC-$(DCU_ID)'/g SUS_SINGLE.adl
sed -i s/#CHANNEL#/'$(IFO):$(SYS)-$(OPTIC)'/g SUS_SINGLE.adl
sed -i s/#PART_NAME#/'$(OPTIC)'/g SUS_SINGLE.adl

2. Edited sitemap.adl so that they open SUS_SINGLE.adl with arguments like
    IFO=C1,SYS=SUS,OPTIC=MC1,DCU_ID=36
instead of opening ./c1mcs/C1SUS_MC1.adl.

3. I also fixed white blocks in the LOCKIN part.

Result:
  Now you don't have to generate every suspension screens. Just edit SUS_SIGNLE.adl.

Things to do:
 - fix every other MEDM screens which open suspension screens, so that they open SUS_SINGLE.adl
 - make SUS_SINGLE.adl more cool

  6655   Tue May 22 00:23:45 2012 DenUpdateCDStransmission error monitor

I've started to create channels and an medm screen to monitor the errors that occur during the transmission through the RFM model. The screen will show the amount of lost data per second for each channel.

Not all channels are ready yet. For created channels, number of errors is 0, this is good.

 Screenshot.png

  6657   Tue May 22 11:32:02 2012 JamieUpdateCDSMEDM suspension screens using macro

Very nice, Yuta!  Don't forget to commit your changes to the SVN.  I took the liberty of doing that for you.  I also tweaked the file a bit, so we don't have to specify IFO and SYS, since those aren't going to ever change.  So the arguments are now only: OPTIC=MC1,DCU_ID=36.  I updated the sitemap accordingly.

Yuta, if you could go ahead and modify the calls to these screens in other places that would be great.  The WATCHDOG, LSC_OVERVIEW, MC_ALIGN screens are ones that immediately come to mind.

And also feel free to make cool new ones.  We could try to make simplified version of the suspension screens now being used at the sites, which are quite nice.

  6658   Tue May 22 11:45:12 2012 JamieConfigurationCDSPlease remember to commit SVN changes

Hey, folks.  Please remember to commit all changes to the SVN in a timely manor.  If you don't, multiple commits will get lumped together and we won't have a good log of the changes we're making.  You might also end up just loosing all of your work.  SVN COMMIT when you're done!  But please don't commit broken or untested code.

pianosa:release 0> svn status | grep -v '^?'
M       cds/c1/models/c1rfm.mdl
M       sus/c1/models/c1mcs.mdl
M       sus/c1/models/c1scx.mdl
M       sus/c1/models/c1scy.mdl
M       isc/c1/models/c1lsc.mdl
M       isc/c1/models/c1pem.mdl
M       isc/c1/models/c1ioo.mdl
M       isc/c1/models/ADAPT_XFCODE_MCL.c
M       isc/c1/models/c1oaf.mdl
M       isc/c1/models/c1gcv.mdl
M       isc/common/medm/OAF_OVERVIEW.adl
M       isc/common/medm/OAF_DOF_BLRMS.adl
M       isc/common/medm/OAF_OVERVIEW_BAK.adl
M       isc/common/medm/OAF_ADAPTATION_MICH.adl
pianosa:release 0>

  6659   Tue May 22 11:47:43 2012 JamieUpdateCDSMEDM suspension screens using macro

Actually, it looks like we're not quite done here.  All the paths in the SUS_SINGLE screen need to be updated to reflect the move.  We should probably make a macro that points to /opt/rtcds/caltech/c1/screens, and update all the paths accordingly.

  6661   Tue May 22 20:01:26 2012 DenUpdateCDSerror monitor

I've created transmission error monitors in rfm, oaf, sus, lsc, scx, scy and ioo models. I tried to get data from every channel transmitted through PCIE and RFM. I also included some shared memory channels.

The medm screen is in the EF STATUS -> TEM. It shows 16384 for the channels that come from simulation plant. Others are 0, that's fine.

  6663   Tue May 22 20:46:38 2012 yutaUpdateCDSMEDM suspension screens using macro

I fixed the problem Jamie pointed out in elog #6657 and #6659.

What I did:
1. Created the following template files in /opt/rtcds/userapps/trunk/sus/c1/medm/templates/ directry.

SUS_SINGLE_LOCKIN1.adl
SUS_SINGLE_LOCKIN2.adl
SUS_SINGLE_LOCKIN_INMTRX.adl
SUS_SINGLE_OPTLEV_SERVO.adl
SUS_SINGLE_PITCH.adl
SUS_SINGLE_POSITION.adl
SUS_SINGLE_SUSSIDE.adl
SUS_SINGLE_TO_COIL_MASTER.adl
SUS_SINGLE_COIL.adl
SUS_SINGLE_YAW.adl
SUS_SINGLE_INMATRIX_MASTER.adl
SUS_SINGLE_INPUT.adl
SUS_SINGLE_TO_COIL_X_X.adl
SUS_SINGLE_OPTLEV_IN.adl
SUS_SINGLE_OLMATRIX_MASTER.adl

To open these files, you have to define $(OPTIC) and $(DCU_ID).
For SUS_SINGLE_TO_COIL_X_X.adl, you also have to define $(FILTER_NUMBER), too. See SUS_SINGLE_TO_COIL_MASTER.adl.

2. Fixed the following screens so that they open SUS_SINGLE.adl.

C1SUS_WATCHDOGS.adl
C1IOO_MC_ALIGN.adl
C1IOO_WFS_MASTER.adl
C1IFO_ALIGN.adl

  6670   Thu May 24 01:17:13 2012 DenUpdateCDSPMC autolocker

Quote:

 

  • SCRIPT
    • Auto-locker for PMC, PSL things - DEN

 I wrote auto-locker for PMC. It is called autolocker_pmc, located in the scripts directory, svn commited. I connected it to the channel C1:PSL-PMC_LOCK.  It is currently running on rosalba. MC autolocker runs on op340m, but I could not execute the script on that machine

op340m:scripts>./autolock_pmc
./autolock_pmc: Stale NFS file handle.

I did several tests, usually, the script locks PMC is a few seconds. However, if PMC DC output has been drift significantly, if might take longer as discussed below.

The algorithm:

       if autolocker if enabled, monitor PSL-PMC_PMCTRANSPD channel
       if TRANS is less then 0.4, start locking:
               disengage PMC servo by enabling PMC TEST 1
               change PSL-PMC_RAMP unless TRANS is higher then 0.4 (*)
               engage PMC servo by disabling PMC TEST 1
       else sleep for 1 sec
 

(*) is tricky. If RAMP (DC offset) is specified then TRANS will be oscillating in the range ( TRANS_MIN, TRANS_MAX ). We are interested only in the TRANS_MAX. To make sure, we estimate it right, TRANS channel is read 10 times and the maximum value is chosen. This works good.

Next problem is to find the proper range and step to vary DC offset RAMP. Of coarse, we can choose the maximum range (-7, 0) and minimum step 0.0001, but it will take too long to find the proper DC offset. For that reason autolocker tries to find a resonance close to the previous DC offset in the range (RAMP_OLD - delta, RAMP_OLD + delta), initial delta is 0.03 and step is 0.003. It resonance is not found in this region, then delta is multiplied by a factor of 2 and so on. During this process RAMP range is controlled not to be wider then (-7, 0).

The might be a better way to do this. For example, use the gradient descent algorithm and control the step adaptively. I'll do that if this realization will be too slow.

I've disabled autolocker_pmc for the night.

  6699   Tue May 29 00:53:57 2012 DenUpdateCDSproblems

I've noticed several CDS problems:

  1. Communication sign on C1SUS model turns to red once in a while. I press diag reset and it is gone. But after some time comes back.
  2. On C1LSC machine red "U" lamp shines with a period ~5 sec.
  3. I was not able to read data from the SR785 using netgpibdata.py. Either connection is not established at all, or data starts to download and then stops in the middle. I've checked the cables, power supplies and everything, still the same thing.
  6734   Thu May 31 22:13:08 2012 JamieUpdateCDSc1lsc: added remaining SHMEM senders for ERR and CTRL, c1oaf model updated appropriately

All the ERR and CTRL outputs in c1lsc now go to SHMEM senders.  I renamed the the CTRL output SHMEM senders to be more generic, since they aren't specifically for OAF anymore.  See attached image from c1lsc.

c1oaf was updated so that SHMEM receivers pointed to the newly renamed senders.

c1lsc and c1oaf were rebuilt, installed, and restarted and are now running.

  6748   Sun Jun 3 23:50:00 2012 DenUpdateCDSbiquad=1

From now all models calculate iir filters using biquad form. I've added biquad=1 to cdsParameters to all models except c1cal, built, installed and restarted them.

  6755   Tue Jun 5 14:47:28 2012 JamieUpdateCDSnew c1tst model for testing RCG code

I made a new model, c1tst, that we can use for debugging the FREQUENT RCG bugs that we keep encountering.  It's a bare model that runs on c1iscey.  Don't do any thing important in here, and don't leave it in some crappy state.  Clean if up when you're done.

  6760   Wed Jun 6 00:32:22 2012 JenneUpdateCDSRFM model is way overloading the cpu

We have too much crap in the rfm model.  CPU time for the rfm model is regularly above 60us, and sometimes in the mid-70's (but sometimes jumps down briefly to ~47us, which is where I think it "used" to sit, but I don't remember when I last thought about that number)

This is potentially causing lots of asynchronous grief.

  6778   Thu Jun 7 03:37:26 2012 yutaUpdateCDSmx_stream restarted on c1lsc, c1ioo

c1lsc and c1ioo computers had FB net statuses all red. So, I restarted mx_stream on each computer.

ssh controls@c1lsc
sudo /etc/init.d/mx_stream restart

  6787   Thu Jun 7 17:49:09 2012 JamieUpdateCDSc1sus in weird state, running models but unresponsive otherwise

Somehow c1sus was in a very strange state.  It was running models, but EPICS was slow to respond.  We could not log into it via ssh, and we could not bring up test points.  Since we didn't know what else to do we just gave it a hard reset.

Once it came it, none of the models were running.  I think this is a separate problem with the model startup scripts that I need to debug.  I logged on to c1sus and ran:

rtcds restart all

(which handles proper order of restarts) and everything came up fine.

Have no idea what happened there to make c1sus freeze like that.  Will keep an eye out.

  6806   Tue Jun 12 17:29:28 2012 DenUpdateCDSdq channels

All PEM and IOO DQ channels disappeared. These channels were commented in C1???.ini files though I've uncommented them a few weeks ago. It happened after these models were rebuild, C1???.ini files also changed. Why?

I added the channels back. mx_stream died on c1sus after I pressed DAQ reload on medm screen. For IOO model it is even worse. After pressing DAQ Reload for C1IOO model DACQ process dies on the FB and IOO machine suspends.

I rebooted IOO, restarted models and fb. Models work now, but there might be an easier way to add channels without rebooting machines and demons.

  6911   Wed Jul 4 17:33:04 2012 JamieUpdateCDStiming, possibly leap second, brought down CDS

I got a call from Koji and Yuta that something was wrong with the CDS system.  I somehow had an immediate suspicion that it had something to do with the recent leap second.

It took a while for nodus to respond, and once he finally let me in I found a bunch of the following in his dmesg, repeated and filling the buffer:

Jul  3 22:41:34 nodus xntpd[306]: [ID 774427 daemon.notice] time reset (step) 0.998366 s
Jul  3 22:46:20 nodus xntpd[306]: [ID 774427 daemon.notice] time reset (step) -1.000847 s

Looking at date on all the front end systems, including fb, I could tell that they all looked a second fast, which is what you would expect if they had missed the leap second.  Everything syncs against nodus, so given nodus's problems above, that might explain everything.

I stopped daqd and nds on fb, and unloaded the mx drivers, which seemed to be showing problems.  I also stopped nodus's xntp:

  sudo /etc/init.d/xntpd stop

His ntp config file is in /etc/inet/ntp.conf, which is definitely the WRONG PLACE, given that the ntp server is not, as far as I can tell, being controlled by inetd.  (nodus is WAY out of date and desperately needs an overhaul.  it's nearly impossible to figure out what the hell is going on in there).  I found an old elog of Rana's that mentioned updating his config to point him to the caltech NTP server, which is now listed in the config, so I tried manually resyncing against that:

  sudo ntpdate -s -b -u 131.215.239.14

Unfortunately that didn't seem to have any effect.  This was making me wonder if the caltech server is off?  Anyway, I tried resyncing against the global NTP pool:

  sudo ntpdate -s -b -u pool.ntp.org

This seemed to work: the clock came back in sync with others that are known good.  Once nodus time was good I reloaded the mx drivers on fb and restarted daqd and nds.  They seemed come up fine.  At this point front ends started coming back on their own.  I went and restarted all the models on the machines that didn't (c1iscey and c1ioo).  Currently everything is looking ok.

I'm worried that there is still a problem with one of the NTP servers that nodus is sync'ing against, and that the problem might come back.  I'll check in again later tonight.

  6915   Thu Jul 5 01:20:58 2012 yutaSummaryCDSslow computers, 0x4000 for all DAQ status

ALS looks OK. I tried to lock FPMI using ALS, but I feel like I need 6 hands to do it with current ALS stability. Now I have all computers being so slow.

It was fine for 7 hours after Jamie the Great fixed this, but fb went down couple times and DAQ status for all models now shows 0x4000. I tried restarting mx_stream and restarting fb, but they didn't help.

  6917   Thu Jul 5 10:49:38 2012 JamieUpdateCDSfront-end/fb communication lost, likely again due to timing offsets

All the front-ends are showing 0x4000 status and have lost communication with the frame builder.  It looks like the timing skew is back again.  The fb is ahead of real time by one second, and strangely nodus is ahead of real time by something like 5 seconds!  I'm looking into it now.

  6918   Thu Jul 5 11:12:53 2012 JenneUpdateCDSfront-end/fb communication lost, likely again due to timing offsets

Quote:

All the front-ends are showing 0x4000 status and have lost communication with the frame builder.  It looks like the timing skew is back again.  The fb is ahead of real time by one second, and strangely nodus is ahead of real time by something like 5 seconds!  I'm looking into it now.

 I was bad and didn't read the elog before touching things, so I did a daqd restart, and mxstream restart on all the front ends, but neither of those things helped.  Then I saw the elog that Jamie's working on figuring it out.

  6920   Thu Jul 5 12:27:05 2012 JamieUpdateCDSfront-end/fb communication lost, likely again due to timing offsets

Quote:

All the front-ends are showing 0x4000 status and have lost communication with the frame builder.  It looks like the timing skew is back again.  The fb is ahead of real time by one second, and strangely nodus is ahead of real time by something like 5 seconds!  I'm looking into it now.

This was indeed another leap second timing issue.  I'm guessing nodus resync'd from whatever server is posting the wrong time, and it brought everything out of sync again.  It really looks like the caltech server is off.  When I manually sync form there the time is off by a second, and then when I manually sync from the global pool it is correct.

I went ahead and updated nodus's config (/etc/inet/ntp.conf) to point to the global pool (pool.ntp.org).  I then restarted the ntp daemon:

  nodus$ sudo /etc/init.d/xntpd stop
  nodus$ sudo /etc/init.d/xntpd start

That brought nodus's time in sync.

At that point all I had to do was resync the time on fb:

  fb$ sudo /etc/init.d/ntp-client restart

When I did that daqd died, but it immediately restarted and everything was in sync.

  6997   Fri Jul 20 17:11:50 2012 JamieUpdateCDSAll custom MEDM screens moved to cds_users_apps svn repo

Since there are various ongoing requests for this from the sites, I have moved all of our custom MEDM screens into the cds_user_apps SVN repository.  This is what I did:

For each system in /opt/rtcds/caltech/c1/medm, I copied their "master" directory into the repo, and then linked it back in to the usual place, e.g.:

a=/opt/rtcds/caltech/c1/medm/${model}/master
b=/opt/rtcds/userapps/trunk/${system}/c1/medm/${model}
mv $a $b
ln -s $b $a

Before committing to the repo, I did a little bit of cleanup, to remove some binary files and other known superfluous stuff.  But I left most things there, since I don't know what is relevant or not.

Then committed everything to the repo.

 

  6999   Sat Jul 21 14:48:33 2012 DenUpdateCDSRCG

As I've spent many hours trying to determine the error in my C code for online filter I decided to write about it to prevent people from doing it again.

I have a C function that was tested offline. I compiled and installed it on the front end machine without any errors. When I've restarted the model, it did not run.

I modified the function the following way

void myFunction()
{
if(STATEMENT) return;
some code
}

I've adjusted input parameters such that STATEMENT was always true. However the model either started or not depending on the code after if statement. It turned out that the model could not start because of the following lines


cosine[1] = 1.0 - 0.5*a*a + a*a*a*a/24 - a*a*a*a*a*a/720 + a*a*a*a*a*a*a*a/40320;
sine[1] = a - a*a*a/6 + a*a*a*a*a/120 - a*a*a*a*a*a*a/5040;

When I've split the sum into steps, the model began to run. I guess the conclusion is that we can not make too many arithmetical operations for one "=" . The most interesting thing is that these lines stood after true if-statement and should not be even executed. Possible explanation is that some compilers start to process code after if-statement during its slow comparison. In our case it could start and then broke down on these long expressions.

ELOG V3.1.3-