40m QIL Cryo_Lab CTN SUS_Lab CAML OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 91 of 355  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  2067   Thu Oct 8 11:10:50 2009 josephb, jenneUpdateComputersEPICs Computer troubles

At around 9:45 the RFM/FB network alarm went off, and I found c1asc, c1lsc, and c1iovme not responding. 

I went out to hard restart them, and also c1susvme1 and c1susvme2 after Jenne suggested that.

c1lsc seemed to have a promising come back initially, but not really.  I was able to ssh in and run the start command.  The green light under c1asc on the RFMNETWORK status page lit, but the reset and CPU usage information is still white, as if its not connected.   If I try to load an LSC channel, say like PD5_DC monitor, as a testpoint in DTT it works fine, but the 16 Hz monitor version for EPICs is dead.  The fact that we were able to ssh into it means the network is working at least somewhat.

I had to reboot c1asc multiple times (3 times total), waiting a full minute on the last power cycle, before being able to telnet in.  Once I was able to get in, I restarted the startup.cmd, which did set the DAQ-STATUS to green for c1asc, but its having the same lack of communication as c1lsc with EPICs.

c1iovme was rebooted, was able to telnet in, and started the startup.cmd.  The status light went green, but still no epics updates.

The crate containing c1susvme1 and c1susvme2 was power cycled.  We were able to ssh into c1susvme1 and restart it, and it came back fully.  Status light, cpu load and channels working.  However I c1susvme2 was still having problems, so I power cycled the crate again.  This time c1susvme2 came back, status light lit green, and its channels started updating.

At this point, lacking any better ideas, I'm going to do a full reboot, cycling c1dcuepics and proceeding through the restart procedures.

  2068   Thu Oct 8 11:37:59 2009 josephbUpdateComputersReboot of dcuepics helped, c1susvme1 having problems

Power cycling c1dcuepics seems to have fixed the EPICs channel problems, and c1lsc, c1asc, and c1iovme are talking again.

I burt restored c1iscepics and c1Iosepics from the snapshot at 6 am this morning.

However, c1susvme1 never came back after the last power cycle of its crate that it shared with c1susvme2.  I connected a monitor and keyboard per the reboot instructions.  I hit ctrl-x, and it proceeded to boot, however, it displays that there's a media error, PXE-E61, suggests testing the cable, and only offers an option to reboot.  From a cursory inspection of the front, the cables seem to look okay.  Also, this machine had eventually come back after the first power cycle and I'm pretty sure no cables were moved in between.

 

  2069   Thu Oct 8 14:41:46 2009 jenneUpdateComputersc1susvme1 is back online

Quote:

Power cycling c1dcuepics seems to have fixed the EPICs channel problems, and c1lsc, c1asc, and c1iovme are talking again.

I burt restored c1iscepics and c1Iosepics from the snapshot at 6 am this morning.

However, c1susvme1 never came back after the last power cycle of its crate that it shared with c1susvme2.  I connected a monitor and keyboard per the reboot instructions.  I hit ctrl-x, and it proceeded to boot, however, it displays that there's a media error, PXE-E61, suggests testing the cable, and only offers an option to reboot.  From a cursory inspection of the front, the cables seem to look okay.  Also, this machine had eventually come back after the first power cycle and I'm pretty sure no cables were moved in between.

 

 I had a go at trying to bring c1susvme1 back online.  The first few times I hit the physical reset button, I saw the same error that Joe mentioned, about needing to check some cables.  I tried one round of rebooting c1sosvme, c1susvme2 and c1susvme1, with no success.  After a few iterations of jiggle cables/reset button/ctrl-x on c1susvme1, it came back.  I ran the startup.cmd script, and re-enabled the suspensions, and Mode Cleaner is now locked.  So, all systems are back online, and I'm crossing my fingers and toes that they stay that way, at least for a little while.

  2080   Mon Oct 12 14:51:41 2009 robUpdateComputersc1susvme2 timing problems update update update

Quote:

It got worse again, starting with locking last night, but it has not recovered.  Attached is a 3-day trend of SRM cpu load showing the good spell.

 Last week, Alex recompiled the c1susvme2 code without the decimation filters for the OUT16 channels, so these channels are now as aliased as the rest of them.  This appears to have helped with the timing issues: although it's not completely cured it is much better.  Attached is a five day trend.

Attachment 1: srmcpu.png
srmcpu.png
  2106   Fri Oct 16 16:44:39 2009 Alberto, SanjitUpdateComputerselog restarted

This afternoon the elog crashed. We just restarted it.

  2179   Thu Nov 5 12:34:26 2009 kiwamuUpdateComputerselog rebooted

I found elog got crashed. I rebooted the elog daemon just 10minutes before.

  2182   Thu Nov 5 16:30:56 2009 peteUpdateComputersmoving megatron

 Joe and I moved megatron and its associated IO chassis from 1Y3 to 1Y9, in preparations for RCG tests at ETMY.

  2183   Thu Nov 5 16:41:14 2009 josephbConfigurationComputersMegatron's personal network

In investigating why megatron wouldn't talk to the network, I re-discovered the fact that it had been placed on its own private network to avoid conflicts with the 40m's test point manager.  So I moved the linksys router (model WRT310N V2) down to 1Y9, plugged megatron into a normal network port, and connected its internet port to the rest of the gigabit network. 

Unfortunately, megatron still didn't see the rest of the network, and vice-versa.  I brought out my laptop and started looking at the settings.  It had been configured with the DMZ zone on for 192.168.1.2, which was Megatron's IP, so communications should flow through the router. Turns out it needs the dhcp server on the gateway router (131.215.113.2) to be on for everyone to talk to each other.  However, this may not be the best practice.  It'd probably be better to set the router IP to be fixed, and turn off the dhcp server on the gateway.  I'll look into doing this tomorrow.

Also during this I found the DNS server running on linux1 had its IP to name and name to IP files in disagreement on what the IP of megatron should be.  The IP to name claimed 131.215.113.95 while the name to IP claimed 131.215.113.178.  I set it so both said 131.215.113.178.  (These are in /var/named/chroot/var/ directory on linux1, the files are 113.215.131.in-addr.arpa.zone  and martian.zone - I modified the 113.215.131.in-addr.arpa.zone file).  This is the dhcp served IP address from the gateway, and in principle could change or be given to another machine while the dhcp server is on.

  2187   Fri Nov 6 00:23:34 2009 AlbertoConfigurationComputersElog just rebooted

The elog just crashed and I rebooted it

  2190   Fri Nov 6 07:55:59 2009 steveUpdateComputersRFMnetwork is down

The RFMnetwork is down.  MC2 sus damping restored.

  2192   Fri Nov 6 10:35:56 2009 josephbUpdateComputersRFM reboot fest and re-enabled ITMY coil drivers

As noted by Steve, the RFM network was down this morning.  I noticed that c1susvme1 sync counter was pegged at 16384, so I decided to start with reboots in that viscinity.

After power cycling crates containing c1sosvme, c1susvme1, and c1susvme2 (since the reset buttons didn't work) only c1sosvme and c1susvme2 came back normally.  I hooked up a monitor and keyboard to c1susvme1, but saw nothing.  I power cycled the c1susvme crate again, and this time I watched it boot properly.  I'm not sure why it failed the first time.

The RFM network is now operating normally.  I have re-enabled the watchdogs again after having turned them off for the reboots.  Steve and I also re-enabled the ITMY coil drivers when I noticed them not damping once the watch dogs were re-enabled.  The manual switches had been set to disabled, so we re-enabled them.

  2195   Fri Nov 6 17:04:01 2009 josephbConfigurationComputersRFM and Megatron

I took the RFM 5565 card dropped off by Jay and installed it into megatron.  It is not very secure, as it was too tall for the slot and could not be locked down.  I did not connect the RFM fibers at this point, so just the card is plugged in.

Unfortunately, on power up, and immediately after the splash screen I get "NMI EVENT!" and "System halted due to fatal NMI". 

The status light on the RFM light remains a steady red as well.  There is a distinct possibility the card is broken in some way.

The card is a VMIPMC-5565 (which is the same as the card used by the ETMY front end machine).  We should get Alex to come in and look at it on Monday, but we may need to get a replacement.

  2196   Fri Nov 6 18:02:22 2009 josephbUpdateComputersElog restarted

While I was writing up an elog entry, the elog died again, and I restarted it.  Not sure what caused it to die since no one was uploading to it at the time.

  2197   Fri Nov 6 18:13:34 2009 josephbUpdateComputersMegatron woes

I have removed the RFM card from Megatron and left it (along with all the other cables and electronics) on the trolly in front of the 1Y9 rack.

Megatron proceeded to boot normally up until it started loading Centos 5.  During the linux boot process it checks the file systems.  At this point we have an error:

 

/dev/VolGroup00/LogVol00 contains a file system with errors, check forced

Error reading block 28901403 (Attempt to read block from filesystem resulted short read) while doing inode scan.

/dev/VolGroup00/LogVol00 Unexpected Inconsistency; RUN fsck MANUALLY

 

So I ran fsck manually, to see if I get some more information.  fsck reports back it can't read block 28901403 (due to a short read), and asks if you want to ignore(y)?.  I ignore (by hitting space), and unfortunately touch it an additional time.  The next question it asks is force rewrite(y)?  So I apparently forced a rewrite of that block.  On further ignores (but no forced rewrites) I continue seeing short read errors at 28901404, *40, *41,*71, *512, *513, etc.  So not totally continugous.  Each iteration takes about 5-10 seconds.  At this point I reboot, but the same problem happens again, although it starts 28901404 instead of 28901403.  So apparently the force re-write fixed something, but I don't know if this is the best way of going about this.  I just wondering if there's any other tricks I can try before I just start rewriting random blocks on the hard drive.  I also don't know how widespread this problem is and how long it might take to complete (if its a large swath of the hard drive and its take 10 seconds for each block that wrong, it might take a while).

So for the moment, megatron is not functional.  Hopefully I can get some advice from Alex on Monday (or from anyone else who wants to chime in).  It may wind up being easiest to just wipe the drive and re-install real time linux, but I'm no expert at that.

 

  2198   Fri Nov 6 18:52:09 2009 peteUpdateComputersRCG ETMY plan

  Koji, Joe, and I are planning to try controlling the ETMY, on Monday or Tuesday.  Our plan is to try to do this with megatron out of the RFM loop.   The RCG system includes pos, pit, yaw, side, and oplevs.  I will use matrix elements as currently assigned (i.e. not the ideal case of +1 and -1 everywhere).   I will also match channel names to the old channels.   We could put buttons on the medm screen to control the analog DW by hand.  

 
This assumes we can get megatron happy again, after the unhappy RFM card test today.  See Joe's elog immediately before this one.
 
We first plan to test that the ETMY watchdog can disable the RCG frontend, by using a pos step (before connecting to the suspension).   Hopefully we can make it work without the RFM.  Otherwise I think we'll have to wait for a working RFM card.
 
We plan to disable the other optics.  We will disable ETMY, take down the ETMY frontend, switch the cables, and start up the new RCG system.  If output looks reasonable we will enable the ETMY via the watchdog.   Then I suppose we can put in some small steps via the RCG controller and see if it damps.  
 
Afterwards, we plan to switch everything back.
  2212   Mon Nov 9 13:22:08 2009 josephb,alexUpdateComputersMegatron update

Alex and I took a look at megatron this morning, and it was in the same state I left it on Friday, with file system errors.  We were able to copy the advLIGO directory Peter had been working in to Linux1, so it should be simple to restore the code.  We then tried just running fsck, and overwritting bad sectors, but after about 5 minutes it was clear it could potentially take a long time (5-10 seconds per unreadable block, with an unknown number of blocks, possibly tens or millions).  The decision was made to simply replace the hard drive.

Alex is of the opinion that the hard drive failure was a coincidence.  Or rather, he can't see how the RFM card could have caused this kind of failure.

Alex went to Bridge to grab a usb to sata adapter for a new hard drive, and was going to copy a duplicate install of the OS onto it, and we'll try replacing the current hard drive with it.

  2215   Mon Nov 9 14:59:34 2009 josephb, alexUpdateComputersThe saga of Megatron continues

Apparently the random file system failure on megatron was unrelated to the RFM card (or at least unrelated to the physical card itself, its possible I did something while installing it, however unlikely).

We installed a new hard drive, with a duplicate copy of RTL and assorted code stolen from another computer.  We still need to get the host name and a variety of little details straightened out, but it boots and can talk to the internet.  For the moment though, megatron thinks its name is scipe11.

You still use ssh megatron.martian to log in though.

We installed the RFM card again, and saw the exact same error as before.  "NMI EVENT!" and "System halted due to fatal NMI".

Alex has hypothesized that the interface card the actual RFM card plugs into, and which provides the PCI-X connection might be the wrong type, so he has gone back to Wilson house to look for a new interface card.  If that doesn't work out, we'll need to acquire a new RFM card at some point

After removing the RFM card, megatron booted up fine, and had no file system errors.  So the previous failure was in fact coincidence.

 

  2220   Mon Nov 9 18:27:30 2009 AlbertoFrogsComputersOMC DCPD Interface Box Disconnected from the power Supply

This afternoon I inadvertently disconnected one of the power cables coming from the power supply on the floor next to the OMC cabinet and going to the DCPD Interface Box.

Rob reconnected the cable as it was before.

  2221   Mon Nov 9 18:32:38 2009 robUpdateComputersOMC FE hosed

It won't start--it just sits at Waiting for EPICS BURT, even though the EPICS is running and BURTed.

 

[controls@c1omc c1omc]$ sudo ./omcfe.rtl
cpu clock 2388127
Initializing PCI Modules
3 PCI cards found
***************************************************************************
1 ADC cards found
        ADC 0 is a GSC_16AI64SSA module
                Channels = 64
                Firmware Rev = 3

***************************************************************************
1 DAC cards found
        DAC 0 is a GSC_16AO16 module
                Channels = 16
                Filters = None
                Output Type = Differential
                Firmware Rev = 1

***************************************************************************
0 DIO cards found
***************************************************************************
1 RFM cards found
        RFM 160 is a VMIC_5565 module with Node ID 130
***************************************************************************
Initializing space for daqLib buffers
Initializing Network
Waiting for EPICS BURT


  2222   Mon Nov 9 19:04:23 2009 robUpdateComputersOMC FE hosed

Quote:

It won't start--it just sits at Waiting for EPICS BURT, even though the EPICS is running and BURTed.

 

[controls@c1omc c1omc]$ sudo ./omcfe.rtl
cpu clock 2388127
Initializing PCI Modules
3 PCI cards found
***************************************************************************
1 ADC cards found
        ADC 0 is a GSC_16AI64SSA module
                Channels = 64
                Firmware Rev = 3

***************************************************************************
1 DAC cards found
        DAC 0 is a GSC_16AO16 module
                Channels = 16
                Filters = None
                Output Type = Differential
                Firmware Rev = 1

***************************************************************************
0 DIO cards found
***************************************************************************
1 RFM cards found
        RFM 160 is a VMIC_5565 module with Node ID 130
***************************************************************************
Initializing space for daqLib buffers
Initializing Network
Waiting for EPICS BURT


 

From looking at the recorded data, it looks like the c1omc started going funny on the afternoon of Nov 5th, perhaps as a side-effect of the Megatron hijinks last week.

 

It works when megatron is shutdown.

  2224   Mon Nov 9 19:44:38 2009 rob, ranaUpdateComputersOMC FE hosed

 

We found that someone had set the name of megatron to scipe11. This is the same name as the existing c1aux in the op440m /etc/hosts file.

We did a /sbin/shutdown on megatron and the OMC now boots.

Please: check to see that things are working right after playing with megatron or else this will sabotage the DR locking and diagnostics.

  2225   Tue Nov 10 10:51:00 2009 josephb, alexUpdateComputersMegatron on, powercycled c1omc, and burt restored from 3am snapshot

Last night around 5pm or so, Alex had remotely logged in and made some fixes to megatron.

First, he changed the local name from scipe11 to megatron.  There were no changes to the network, this was a purely local change.  The name server running on Linux1 is what provides the name to IP conversions.  Scipe11 and Megatron both resolve to distinct IPs. Given c1auxex wasn't reported to have any problems (and I didn't see any problems with it yesterday), this was not a source of conflict.  Its possible that Megatron could get confused while in that state, but it would not have affected anything outside its box.

Just to be extra secure, I've switched megatron's personal router over from a DMZ setup to only forwarding port 22.  I have also disabled the dhcp server on the gateway router (131.215.113.2).

Second, he turned the mdp and mdc codes on.  This should not have conflicted with c1omc.

This morning I came in and turned megatron back on around 9:30 and began trying to replicate the problems from last night between c1omc and megatron.  I called Alex and we rebooted c1omc while megatron was on, but not running any code, and without any changes to the setup (routers, etc).  We were able to burt restore.  Then we turned the mdp, mdc and framebuilder codes on, and again rebooted c1omc, which appeared to burt restore as well (I restored from 3 am this morning, which looks reasonable to me). 

Finally, I made the changes mentioned above to the router setups in the hope that this will prevent future problems but without being able to replicate the issue I'm not sure.

  2228   Tue Nov 10 17:49:20 2009 AlbertoMetaphysicsComputersTest Point Number Mapping

I found this interesting entry by Rana in the old (deprecated) elog : here

I wonder if Rolf has ever written the mentioned GUI that explained the rationale behind the test point number mapping.

I'm just trying to add the StochMon calibrated channels to the frames. Now I remember why I kept forgetting of doing it...

  2231   Tue Nov 10 21:46:31 2009 ranaSummaryComputersTest Point Number Mapping

Quote:

I found this interesting entry by Rana in the old (deprecated) elog : here

I wonder if Rolf has ever written the mentioned GUI that explained the rationale behind the test point number mapping.

I'm just trying to add the StochMon calibrated channels to the frames. Now I remember why I kept forgetting of doing it...

 As far as I know, the EPICS channels have nothing to do with test points.

  2253   Thu Nov 12 12:50:35 2009 AlbertoUpdateComputersStochMon calibrated channels added to the data trend

I added the StochMon calibrated channels to the data trend by including the following channel names in the C0EDCU.ini file:

[C1:IOO-RFAMPD_33MHZ_CAL]
[C1:IOO-RFAMPD_133MHZ_CAL]
[C1:IOO-RFAMPD_166MHZ_CAL]
[C1:IOO-RFAMPD_199MHZ_CAL]

Before saving the changes I committed C0EDCU.ini to the svn.

Then I restarted the frame builder so now the new channels can be monitored and trended.

  2255   Thu Nov 12 15:40:27 2009 josephb, koji, peterUpdateComputersETMY and Megatron test take 1

We connected megatron to the IO chassis which in turn was plugged into the rest of the ITMY setup.  We had manually turned the watchdogs off before we touched anything, to ensure we didn't accidently drive the optic.  The connections seem to go smoothly.

However, on reboot of megatron with the IO chassis powered up, we were unable to actually start the code.  (The subsystem has been renamed from SAS to TST, short for test).  While starttst claimed to start the IOC Server, we couldn't find the process running, nor did the medm screens associated with it work.

As a sanity test, we tried running mdp, Peter's plant model, but even that didn't actually run.  Although it also gave an odd error we hadn't seen before:

"epicsThreadOnceOsd epicsMutexLock failed."

Running startmdp a second time didn't give the error message, but still no running code.  The mdp medm screens remained white.

We turned the IO chassis off and rebooted megatron, but we're still having the same problem.

 

Things to try tomorrow:

1) Try disconnecting megatron completely from the IO chassis and get it to a state identical to that of last night, when the mdp and mdc did run.

2) Confirm the .mdl files are still valid, and try rebuilding them

  2264   Fri Nov 13 09:47:18 2009 josephbUpdateComputersMegatron status lights lit

Megatron's top fan, rear ps, and temperature front panel lights were all lit amber this morning.  I checked the service manual, found at :

http://docs.sun.com/app/docs/prod/sf.x4600m2?l=en&a=view

According to the manual, this means a front fan failed, a voltage event occured, and we hit a high temperature threshold.  However, there were no failure light on any of the individual front fans (which should have been the case given the front panel fan light).  The lights remained on after I shutdown megatron.  After unplugging, waiting 30 seconds, and replugging the power cords in, the lights went off and stayed off.  Megatron seems to come up fine.

I unplugged the IO chassis from megatron, rebooted, and tried to start Peter's plant model.  However, it still prints that its starting, but really doesn't.  One thing I forgot to mention in the previous elog on the matter, is that on the local monitor it prints "shm_open(): No such file or directory" every time we try to start one of these programs.

  2265   Fri Nov 13 09:54:14 2009 josephbConfigurationComputersMegatron switched to tcsh

I've changed megatron's controls account default shell to tcsh (like it was before).  It now sources cshrc.40m in /cvs/cds/caltech/ correctly at login, so all the usual aliases and programs work without doing any extra work.

  2266   Fri Nov 13 10:28:03 2009 josephb, alexUpdateComputersMegatron is back to its old self

I called Alex this morning and explained the problems with megatron.

Turns out when he had been setting up megatron, he thought a startup script file, rc.local was missing in the /etc directory.  So he created it.  However, the rc.local file in the /etc directory is normally just a link to the /etc/rc.d/rc.local file.  So on startup (basically when we rebooted the machine yesterday), it was running an incorrect startup script file.  The real rc.local includes line:

/usr/bin/setup_shmem.rtl mdp mdc&

Hence the errors we were getting with shm_open().  We changed the file into a soft link, and resourced the rc.local script and mdp started right up.  So we're back to where we were 2 nights ago (although we do have an RFM card in hand).

Update:  The tst module wouldn't start, but after talking to Alex again, it seems that I need to add the module tst to the /usr/bin/setup_shmem.rtl mdp mdc& line in order for it to have a shared memory location setup for it.  I have edited the file (/etc/rc.d/rc.local), adding tst at the end of the line.  On reboot and running starttst, the code actually loads, although for the moment, I'm still getting blank white blocks on the medm screens.

  2267   Fri Nov 13 14:04:27 2009 josephb, kojiUpdateComputersUpdated wiki with RCG instructions/tips

I've placed some notes pertaining to what Koji and I have learned today about getting the RCG code working on the 40m wiki at:

http://lhocds.ligo-wa.caltech.edu:8000/40m/Notes_on_getting_the_CDS_Realtime_Code_Generator_working

We're still trying to fix the tst system, as the moment its reporting an invalid number of daq channels and during daq initialization it fails.  (This from the /cvs/cds/caltech/target/c1tst/log.txt file). Note: This problem is only on megatron and separated from the conventional DAQ system of the 40m.

cpu clock 2800014
Warning, could open `/rtl_mem_tst' read/write (errno=0)
configured to use 2 cards
Initializing PCI Modules
2 PCI cards found
***************************************************************************
1 ADC cards found
        ADC 0 is a GSC_16AI64SSA module
                Channels = 64
                Firmware Rev = 512

***************************************************************************
1 DAC cards found
        DAC 0 is a GSC_16AO16 module
                Channels = 16
                Filters = None
                Output Type = Differential
                Firmware Rev = 3

***************************************************************************
0 DIO cards found
***************************************************************************
0 IIRO-8 Isolated DIO cards found
***************************************************************************
0 IIRO-16 Isolated DIO cards found
***************************************************************************
0 Contec 32ch PCIe DO cards found
0 DO cards found
***************************************************************************
0 RFM cards found
***************************************************************************
Initializing space for daqLib buffers
Initializing Network
Found 1 frameBuilders on network
Waiting for EPICS BURT at 0.000000 and 0 ns 0x3c40c004
BURT Restore = 1
Waiting for Network connect to FB - 10
Reconn status = 0 1
Reconn Check = 0 1
Initialized servo control parameters.
DAQ Ex Min/Max = 1 32
DAQ Tp Min/Max = 10001 10094
DAQ XTp Min/Max = 10094 10144
DAQ buffer 0 is at 0x8819a020
DAQ buffer 1 is at 0x8839a020
daqLib DCU_ID = 10
DAQ DATA INFO is at 0x3e40f0a0
Invalid num daq chans = 0
DAQ init failed -- exiting

  2268   Fri Nov 13 15:01:07 2009 JenneUpdateComputersUpdated wiki with RCG instructions/tips

Quote:

I've placed some notes pertaining to what Koji and I have learned today about getting the RCG code working on the 40m wiki at:

http://lhocds.ligo-wa.caltech.edu:8000/40m/Notes_on_getting_the_CDS_Realtime_Code_Generator_working

We're still trying to fix the tst system, as the moment its reporting an invalid number of daq channels and during daq initialization it fails.  (This from the /cvs/cds/caltech/target/c1tst/log.txt file).

 Dmass tells me that you have to record at least one channel.  ie at least one channel in your .ini file must be set to acquire, otherwise the DAQ will flip out.  It seems to be unhappy when you're not asking it to do things.

  2269   Fri Nov 13 22:01:54 2009 KojiUpdateComputersUpdated wiki with RCG instructions/tips

I continued on the STAND ALONE debugging of the megatron codes.

- I succeeded to run c1aaa with ADC/DAC. (c1aaa is a play ground for debugging.)

  The trick was "copy DAC block from sam.mdl to aaa.mdl".
  I don't understand why this works. But it worked.
  I still have the problem of the matrices. Their medm screens are always blank. Needs more works.

- Also I don't understand why I can not run the build of c1tst when I copy the working aaa.mdl to tst.mdl.

- The problem Joe reported: "# of channels to be daqed" was solved by

make uninstall-daq-aaa
make install-daq-aaa

  This command is also useful.

daqconfig

- Now I am in the stable development loop with those commands

killaaa
make uninstall-daq-aaa
make aaa
make install-aaa
make install-daq-aaa
make install-screens-aaa
startaaa

  I have made "go_build" script under /home/controls/cds/advLigo

usage:
./go_build aaa

- Note for myself: frequently visited directories

/home/controls/cds/advLigo/src/epics/simLink (for model)
/home/controls/cds/advLigo
(to build)
/cvs/cds/caltech/target/c1aaa
(realtime code log)
/cvs/cds/caltech/target/c1aaaepics (ioc log)
/cvs/cds/caltech/medm/c1/aaa (medm screens)
/cvs/cds/caltech/chans
(filter coeffs)
/cvs/cds/caltech/chans/daq (daq settings)

  2270   Sat Nov 14 06:46:48 2009 KojiUpdateComputersUpdated wiki with RCG instructions/tips

I am still working on the c1aaa code. Now it seems that C1AAA is working reasonably (...so far).

1) At a certain point I wanted clean up the system status. I have visited /etc/rc.local to add c1aaa for realtime to non-realtime task

before:
/usr/bin/setup_shmem.rtl mdp mdc tst&
after:
/usr/bin/setup_shmem.rtl mdp mdc tst aaa&

   I rebooted the system several times.

sudo /sbin/reboot

2) I found that gabage medm screens accumulated in ~/cds/advLigo/build/aaaepics/medm after many trials with several simulink models.
This directory is always copied to /cvs/cds/caltech/medm/c1/aaa at every make install-screens-aaa
This caused very confusing MEDM screens in the medm dir like C1AAA_ETMX_IN_MATRX.adl (NOT ETMY!)

I did

cd ~/cds/advLigo
make clean-aaa

to refresh aaaepics dir. The current development procedure is

killaaa
make clean-aaa
make uninstall-daq-aaa
make aaa
make install-aaa
make install-daq-aaa
make install-screens-aaa
startaaa

3) Sometimes startaaa does not start the task properly. If the task does not work, don't abandon.
Try restart the task. This may help. 

killaaa
(deep breathing several times)
startaaa

What to do next:

- MEDM works

* make more convenient custom MEDM screens so that we can easily access to the filters and switches
* retrofit the conventional SUS MEDM to the new system

- once again put/confirm the filter coeffs and the matrix elements

- configure DAQ setting so that we can observe suspension motion by dataviewer / dtt

- connect the suspension to megatron again

- test the control loop

  2273   Mon Nov 16 15:13:25 2009 josephbUpdateComputersezcaread updated to Yoichi style ezcawrite

In order to get the gige camera code running robustly here at the 40m, I created a "Yoichi style" ezcaread, which is now the default, while the original ezcaread is located in ezcaread.bin.  This tries 5 times before failing out of a read attempt.

  2276   Mon Nov 16 17:24:28 2009 josephbConfigurationComputersCamera medm functionality improved

Currently the Camera medm screen (now available from the sitemap), includes a server and client script buttons.  The server has two options.  One which starts a server, the second which (for the moment) kills all copies of the server running on Ottavia.  The client button simply starts a video screen with the camera image.  The slider on this screen changes the exposure level.  The snap shot button saves a jpeg image in the /cvs/cds/caltech/cam/c1asport directory with a date and time stamp on it (up to the second).  For the moment, these buttons only work on Linux machines.

All channels were added to C0EDCU.ini, and should be being recorded for long term viewing.

Feel free to play around with it, break it, and let me know how it works (or doesn't).

  2278   Tue Nov 17 00:42:12 2009 KojiUpdateComputersUpdated wiki with RCG instructions/tips

Dmass, Joe, Koji


A puzzle has been solved: Dmass gave us a great tip

"The RGC code does not work unless the name of the mdl file (simulink model) matches to the model name "

The model name is written in the second line. This is automatically modified if the mdl file is saved from simulink.
But we copied the model by using "cp" command. This prevent from the TST model working!

megatron:simLink>head tst.mdl
Model {
  Name                    "tst"
  Version                 7.3
  MdlSubVersion           0

...
...
...

This explained why the AAA model worked when the DAC block has been copied from the other model.
This was not because of the ADC block but the saving model fixed the model name mismatch!


Now our current working model is "C1TST". Most of the functionalities have been implemented now:

  • The simulink model has been modified so that some of the functionalities can be accomodated, such as LSC/ASC PIT/ASC YAW.
  • Some filter names are fixed so as to inherit the previous naming conventions.
  • The SUS-ETMY epics screen was modified to fit to the new channel names, the filter topologies, and the matrices.
  • The chans file was constructed so that the conventional filter coefficients are inherited.
  • All of the gains, filter SWs, matrix elements have been set accordingly to the current ETMY settings.
  • burt snapshot has been taken: /cvs/cds/caltech/target/c1tstepics/controls_1091117_024223_0.snap
    burtrb -f /cvs/cds/caltech/target/c1tstepics/autoBurt.req -o controls_1091117_024223_0.snap -l /tmp/controls_1091117_024215_0.read.log -v

What to do next:

  • Revisit Oplev model so that it accomodates a power normalization functionality.
  • ETMY QPD model is also missing!
  • Clean up mdl file using subsystem grouping
  • Check consistency of the whitening/dewhitening switches.
  • Connect ADC/DAC to megatron
  • Test of the controllability
  • BTW, what is happened to BIO?
  • Implementation of the RFM card

Directories and the files:

  • The .mdl file is backed up as
    /home/controls/cds/advLigo/src/epics/simLink/tst.mdl.20091116_2100

  • The default screens built by "make" is installed in
    /cvs/cds/caltech/medm/c1/tst/
    They are continuously overridden by the further building of the models.

  • The custom-built medm screens are stored in
    /cvs/cds/caltech/medm/c1/tst/CustomAdls/

    The backup is
    /cvs/cds/caltech/medm/c1/tst/CustomAdls/CustomAdls.111609_2300/

  • The custom-built chans file is
    /cvs/cds/caltech/chans/C1TST.txt

    The backup is
    /cvs/cds/caltech/chans/C1TST.111609

  • burt snap shot file
    /cvs/cds/caltech/target/c1tstepics/controls_1091117_024223_0.snap
  2299   Thu Nov 19 09:55:41 2009 josephbUpdateComputersTrying to get testpoints on megatron

This is a continuation from last night, where Peter, Koji, and I were trying to get test point channels working on megatron and with the TST module.

Things we noticed last night:

We could run starttst, and ./daqd -c daqdrc, which allowed us to get some channels in dataviewer.  The default 1k channel selection works, but none of the testpoints do. 

However, awgtpman -s tst does appear in the processes running list.

The error we get from dataviewer is:

Server error 861: unable to create thread
Server error 23328: unknown error
datasrv: DataWriteRealtime failed: daq_send: Illegal seek

Going to DTT, it starts with no errors in this configuration.  Initially it listed both MDC and TST channels.  However, as a test, I moved the tpchn_C4.par , tpchn_M4.par and tpchn_M5.par files to the directory backup, in /cvs/cds/caltech/target/gds/param.  This caused only the TST channels to show up (which is what we want when not running the mdc module.

We had changed the daqdrc file in /cvs/cds/caltech/target/fb, several times to get to this state.  According to the directions in the RCG manual written by Rolf, we're supposed to "set cit_40m=1" in the daqdrc file, but it was commented out.  However, when we uncommented it, it started causing errors on dtt startup, so we put it back.  We also tried adding lines:

set dcu_rate 13 = 16384;
set dcu_rate 14 = 16384;

But this didn't seem to help.  The reason we did this is we noticed dcuid = 13 and dcuid = 14 in the /cvs/cds/caltech/target/gds/param/tpchn_C1.par file.  We also edited the testpoint.par file so that it correctly corresponded to the tst module, and not the mdc and mdp modules.  We basically set:

[C-node1]
hostname=192.168.1.2
system=tst

in that file, and commented everything else out.

At this point, given all the things we've changed, I'm going to try a rebuild of the tst and daq and see if that solves things.

 

  2300   Thu Nov 19 10:19:04 2009 josephbUpdateComputersMegatron tst status

I did a full make clean and make uninstall-daq-tst, then rebuilt it.  I copied a good version of filters to C1TST.txt in /cvs/cds/caltech/chans/ as well as a good copy of screens to /cvs/cds/caltech/medm/c1/tst/.

Test points still appear to be broken.  Although for a single measurement in dtt, I was somehow able to start, although the output in the results page didn't seem to have any actual data in the plots, so I'm not sure what happened there - after that it just said unable to select test points.  It now says that when starting up as well.  The tst channels are the only ones showing up.  However, the 1k channels seem to have disappeared from Data Viewer, and now only 16k channels are selectable, but they don't actually work.  I'm not actually sure where the 1k channels were coming from earlier now that I think about it.  They were listed like C1:TST-ETMY-SENSOR_UL and so forth.

RA: Koji and I added the SENSOR channels by hand to the .ini file last night so that we could have data stored in the frames ala c1susvme1, etc.

  2301   Thu Nov 19 11:33:15 2009 josephbConfigurationComputersMegatron

I tried rebooting megatron, to see if that might help, but everything still acts the same. 

I tried using daqconfig and changed channels from deactiveated to activated.  I learned by activating them all, that the daq can't handle that, and eventually aborts from an assert checking a buffer size being too small.  I also tried activating 2 and looking at those channels, and it looks like the _DAQ versions of those channels work, or at least I get 0's out of C1:TST-ETMY_ASCPIT_OUT_DAQ (which is set in C1TST.ini file).

I've added the SENSOR channels back to the /csv/cds/caltech/chans/daq/C1TST.ini file, and those are again working in data viewer.

At this point, I'm leaving megatron roughly in the same state as last night, and am going to wait on a response from Alex.

  2305   Fri Nov 20 11:01:58 2009 josephb, alexConfigurationComputersWhere to find RFM offsets

Alex checked out the old rts (which he is no longer sure how to compile) from CVS to megatron, to the directory:

/home/controls/cds/rts/

In /home/controls/cds/rts/src/include you can find the various h files used.  Similarly, /fe has the c files.

In the h files, you can work out the memory offset by noting the primary offset in iscNetDsc40m.h

A line like suscomms.pCoilDriver.extData[0] determines an offset to look for.

0x108000 (from suscomms )

Then pCoilDriver.extData[#] determines a further offset.

sizeof(extData[0]) = 8240  (for the 40m - you need to watch the ifdefs, we were looking at the wrong structure for awhile, which was much smaller).

DSC_CD_PPY is the structure you need to look in to find the final offset to add to get any particular channel you want to look at.

The number for ETMX is 8, ETMY 9 (this is in extData), so the extData offset from 0x108000 for ETMY should be 9 * 82400.  These numbers (i.e. 8 =ETMX, 9=ETMY) can be found in losLinux.c in /home/controls/cds/rts/src/fe/40m/.  There's a bunch of #ifdef and #endif which define ETMX, ETMY, RMBS, ITM, etc.  You're looking for the offset in those.

So for ETMY LSC channel (which is a double) you add 0x108000 (a hex number) + (9 * 82400 + 24) (not in hex, need to convert) to get the final value of 0x11a160 (in hex).

-----------

A useful program to interact with the RFM network can be found on fb40m.  If you log in and go to:

/usr/install/rfm2g_solaris/vmipci/sw-rfm2g-abc-005/util/diag

you can then run rfm2g_util, give it a 3, then type help.

You can use this to read data.  Just type help read.  We had played around with some offsets and various channels until we were sure we had the offsets right.  For example, we fixed an offset into the ETMY LSC input, and saw the corresponding memory location change to that value.  This utility may also be useful for when we do the RFM test to check the integrity of the ring, as there are some diagnostic options available inside it.

  2306   Fri Nov 20 11:14:22 2009 josephb, alexConfigurationComputerstest points working on megatron and we may have filters with switch outputs built in

Alex tooked at the channel definitions (can be seen in tpchn_C1.par), and noticed the rmid was 0. 

However, we had set in testpoint.par the tst system to C-node1 instead of C-node0.  The final number inf that and the rmid need to be equal.   We have changed this, and the test points appear to be working now.

However, the confusing part is in the tst model, the gds_node_id is set to 1.  Apparently, the model starts counting at 1, while the code starts counting at 0, so when you edit the testpoint.par file by hand, you have to subtract one from whatever you set in the model.

In other news, Alex pointed me at a CDS_PARTS.mdl, filters, "IIR FM with controls".  Its a light green module with 2 inputs and 2 outputs.  While the 2nd set of input and outputs look like they connect to ground, they should be iterpreted by the RCG to do the right thing (although Alex wasn't positive it works, it worth trying it and seeing if the 2nd output corresponds to a usable filter on/off switch to connect to the binary I/O to control analog DW.  However, I'm not sure it has the sophistication to wait for a zero crossing or anything like that - at the moment, it just looks like a simple on/off switch based on what filters are on/off.

  2315   Mon Nov 23 17:53:08 2009 JenneUpdateComputers40m frame builder backup acting funny

As part of the fb40m restart procedure (Sanjit and I were restarting it to add some new channels so they can be read by the OAF model), I checked up on how the backup has been going.  Unfortunately the answer is: not well.

Alan imparted to me all the wisdom of frame builder backups on September 28th of this year.  Except for the first 2 days of something having gone wrong (which was fixed at that time), the backup script hasn't thrown any errors, and thus hasn't sent any whiny emails to me.  This is seen by opening up /caltech/scripts/backup/rsync.backup.cumlog , and noticing that  after October 1, 2009, all of the 'errorcodes' have been zero, i.e. no error (as opposed to 'errorcode 2' when the backup fails).  

However, when you ssh to the backup server to see what .gwf files exist, the last one is at gps time 941803200, which is Nov 9 2009, 11:59:45 UTC.  So, I'm not sure why no errors have been thrown, but also no backups have happened. Looking at the rsync.backup.log file, it says 'Host Key Verification Failed'.  This seems like something which isn't changing the errcode, but should be, so that it can send me an email when things aren't up to snuff.  On Nov 10th (the first day the backup didn't do any backing-up), there was a lot of Megatron action, and some adding of StochMon channels.  If the fb was restarted for either of these things, and the backup script wasn't started, then it should have had an error, and sent me an email.  Since any time the frame builder's backup script hasn't been started properly it should send an email, I'm going to go ahead and blame whoever wrote the scripts, rather than the Joe/Pete/Alberto team.

Since our new raid disk is ~28 days of local storage, we won't have lost anything on the backup server as long as the backup works tonight (or sometime in the next few days), because the backup is an rsync, so it copies anything which it hasn't already copied.  Since the fb got restarted just now, hopefully whatever funny business (maybe with the .agent files???) will be gone, and the backup will work properly. 

I'll check in with the frame builder again tomorrow, to make sure that it's all good.

  2322   Tue Nov 24 16:06:45 2009 JenneUpdateComputers40m frame builder backup acting funny

Quote:

As part of the fb40m restart procedure (Sanjit and I were restarting it to add some new channels so they can be read by the OAF model), I checked up on how the backup has been going.  Unfortunately the answer is: not well.

I'll check in with the frame builder again tomorrow, to make sure that it's all good.

 All is well again in the world of backups.  We are now up to date as of ~midnight last night. 

  2330   Wed Nov 25 11:10:05 2009 JenneUpdateComputers40m frame builder backup acting funny

Quote:

Quote:

As part of the fb40m restart procedure (Sanjit and I were restarting it to add some new channels so they can be read by the OAF model), I checked up on how the backup has been going.  Unfortunately the answer is: not well.

I'll check in with the frame builder again tomorrow, to make sure that it's all good.

 All is well again in the world of backups.  We are now up to date as of ~midnight last night. 

 Backup Fail.  At least this time however, it threw the appropriate error code, and sent me an email saying that it was unhappy.  Alan said he was going to check in with Stuart regarding the confusion with the ssh-agent.  (The other day, when I did a ps -ef | grep agent, there were ~5 ssh-agents running, which could have been then cause of the unsuccessful backups without telling me that they failed.  The main symptom is that when I first restart all of the ssh-agent stuff, according to the directions in the Restart fb40m Procedures, I can do a test ssh over to ldas-cit, to see what frames are there.  If I log out of the frame builder and log back in, then I can no longer ssh to ldas-cit without a password.  This shouldn't happen....the ssh-agent is supposed to authenticate the connection so no passwords are necessary.) 

I'm going to restart the backup script again, and we'll see how it goes over the long weekend. 

  2347   Mon Nov 30 11:45:54 2009 JenneUpdateComputersWireless is back

When Alberto was parting the Red Sea this morning, and turning it green, he noticed that the wireless had gone sketchy.

When I checked it out, the ethernet light was definitely blinking, indicating that it was getting signal.  So this was not the usual case of bad cable/connector which is a known problem for our wireless (one of these days we should probably relay that ethernet cable....but not today).  After power cycling and replugging the ethernet cable, the light for the 2.4GHz wireless was blinking, but the 5GHz wasn't.  Since the wireless still wasn't working, I checked the advanced configuration settings, as described by Yoichi's wiki page:  40m Network Page

The settings had the 5GHz disabled, while Yoichi's screenshots of his settings showed it enabled.  Immediately after enabling the 5GHz, I was able to use the laptop at Alberto's length measurement setup to get online.  I don't know how the 5GHz got disabled, unless that happened during the power cycle (which I doubt, since no other settings were lost), but it's all better now.

 

  2348   Mon Nov 30 16:23:51 2009 JenneUpdateComputersc1omc restarted

I found the FEsync light on the OMC GDS screen red.  I power cycled C1OMC, and restarted the front end code and the tpman.  I assume this is a remnant of the bootfest of the morning/weekend, and the omc just got forgotten earlier today.

  2364   Tue Dec 8 09:18:07 2009 JenneUpdateComputersA *great* way to start the day....

Opening of ETMY has been put on hold to deal with the computer situation.  Currently all front end computers are down.  The DAQ AWGs are flashing green, but everything else is red (fb40m is also green).  Anyhow, we'll deal with this, and open ETMY as soon as we can.

The computers take priority because we need them to tell us how the optics are doing while we're in the chambers, fitzing around.  We need to be sure we're not overly kicking up the suspensions. 

  2365   Tue Dec 8 10:20:33 2009 AlbertoDAQComputersBootfest succesfully completed

Alberto, Kiwamu, Koji,

this morning we found the RFM network and all the front-ends down.

To fix the problem, we first tried a soft strategy, that is, we tried to restart CODAQCTRL and C1DCUEPICS alone, but it didn't work.

We then went for a big bootfest. We first powered off fb40m, C1DCUEPICS, CODAQCTRL, reset the RFM Network switch. Then we rebooted them in the same order in which we turned them off.

Then we power cycled and restarted all the front-ends.

Finally we restored all the burt snapshots to Monday Dec 7th at 20:00.

  2376   Thu Dec 10 08:40:12 2009 AlbertoUpdateComputersFronte-ends down

I found all the front-ends, except for C1SUSVME1 and C0DCU1 down this morning. DAQAWG shows up green on the C0DAQ_DETAIL screen but it is on a "bad" satus.

I'll go for a big boot fest.

  2378   Thu Dec 10 08:50:33 2009 AlbertoUpdateComputersFronte-ends down

Quote:

I found all the front-ends, except for C1SUSVME1 and C0DCU1 down this morning. DAQAWG shows up green on the C0DAQ_DETAIL screen but it is on a "bad" satus.

I'll go for a big boot fest.

Since I wanted to single out the faulting system when these situations occur, I tried to reboot the computers one by one.

1) I reset the RFM Network by pushing the reset button on the bypass switch on the 1Y7 rack. Then I tried to bring C1SOSVME up by power-cycling and restarting it as in the procedure in the wiki. I repeated a second time but it didn't work. At some point of the restarting process I get the error message "No response from EPICS".
2) I also tried rebooting only C1DCUEPICS but it didn't work: I kept having the same response when restarting C1SOSVME
3) I tried to reboot C0DAQCTRL and C1DCU1 by power cycling their crate; power-cycled and restarted C1SOSVME. Nada. Same response from C1SOSVME.
4) I restarted the framebuilder;  power-cycled and restarted C1SOSVME. Nothing. Same response from C1SOSVME.
5) I restarted the framebuilder, then rebooted C0DAQCTRL and C1DCU, then power-cycled and restarted C1SOSVME. Niente. Same response from C1SOSVME.
 
Then I did the so called "Nuclear Option", the only solution that so far has proven to work in these circumstances. I executed the steps in the order they are listed, waiting for each step to be completed before passing to the next one.
0) Switch off: the frame builder, the C0DAQCTRL and C1DCU crate, C1DCUEPICS
1) turn on the frame builder
2) reset of the RFM Network switch on 1Y7 (although, it's not sure whether this step is really necessary; but it's costless)
3) turn on C1DCUEPICS
4) turn on the C0DAQCTRL and C1DCU crate
5) power-cycle and restart the single front-ends
6) burt-restore all the snapshots
 
When I tried to restart C1SOSVME by power-cycling it I still got the same response: "No response from EPICS". But I then reset C1SUSVME1 and C1SUSVME2 I was able to restart C1SOSVME.
 
It turned out that while I was checking the efficacy of the steps of the Grand Reboot to single out the crucial one, I was getting fooled by C1SOSVME's status. C1SOSVME was stuck, hanging on C1SUSVME1 and C1SUSVME2.
 
So the Nuclear option is still unproven as the only working procedure. It might be not necessary.
 
Maybe restating BOTH RFM switches, the one in 1Y7 and the one in 1Y6, would be sufficient. Or maybe just power-cycling the C0DAQCTRL and C1DCU1 is sufficient. This has to be confirmed next time we incur on the same problem.
ELOG V3.1.3-