40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 48 of 344  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  9822   Thu Apr 17 11:00:54 2014 jamieUpdateCDSfailed attempt to get Dolphin working on c1ioo

I've been trying to get c1ioo on the Dolphin network, but have not yet been successful.

Background: if we can put the c1ioo machine on the fast Dolphin IPC network, we can essentially eliminate latencies between the c1als model and the c1lsc model, which are currently connected via a rube goldberg-esq c1lsc->dolphin->c1sus->rfm->c1ioo configuration.

Rolf gave us a Dolpin host adapter card, and we purchased a Dolphin fiber cable to run from the 1X2 rack to the 1X4 rack where the Dolphin switch is.

Yesterday I installed the dolphin card into c1ioo.  Unfortunately, c1ioo, which is Sun Fire X4600, and therefore different than the rest of the front end machines, doesn't seem to be recognizing the card.  The /etc/dolphin_present.sh script, which is supposed to detect the presence of the card by grep'ing for the string 'Stargen' in the lspci output, returns null.

I've tried moving the card to different PCIe slots, as well as swapping it out with another Dolphin host adapter that we have.  Neither worked.

I looked at the Dolphin host adapter installed in c1lsc and it's quite different, presumably a newer or older model.  Not sure if that has anything to do with anything.

I'm contacting Rolf to see if he has any other ideas.

  9824   Thu Apr 17 16:59:45 2014 jamieUpdateCDSslightly more successful attempt to get Dolphin working on c1ioo

So it turns out that the card that Rolf had given me was not a Dolphin host adapter after all.  He did have an actual host adapter board on hand, though, and kindly let us take it.  And this one works!

I installed the new board in c1ioo, and it recognized it.  Upon boot, the dolphin configuration scripts managed to automatically recognize the card, load the necessary kernel modules, and configure it.  I'll describe below how I got everything working.

However, at some point mx_stream stopped working on c1ioo.  I have no idea why, and it shouldn't be related to any of this dolphin stuff at all.  But given that mx_stream stopped working at the same time the dolphin stuff started working, I didn't take any chances and completely backed out all the dolphin stuff on c1ioo, including removing the dolphin host adapter from the chassis all together.  Unfortunately that didn't fix any of the mx_stream issues, so mx_stream continues to not work on c1ioo.  I'll follow up in a separate post about that.  In the meantime, here's what I did to get dolphin working on c1ioo:

c1ioo Dolphin configuration

To get the new host recognized on the Dolphin network, I had to make a couple of changes to the dolphin manager setup on fb.  I referenced the following page:

https://cdswiki.ligo-la.caltech.edu/foswiki/bin/view/CDS/DolphinHowTo

Below are the two patches I made to the dolphin ("dis") config files on fb:

--- /etc/dis/dishosts.conf.bak    2014-04-17 09:31:08.000000000 -0700
+++ /etc/dis/dishosts.conf    2014-04-17 09:28:27.000000000 -0700
@@ -26,6 +26,8 @@
 ADAPTER:  c1sus_a0 8 0 4
 HOSTNAME: c1lsc
 ADAPTER:  c1lsc_a0 12 0 4
+HOSTNAME: c1ioo
+ADAPTER:  c1ioo_a0 16 0 4
 
 # Here we define a socket adapter in single mode.
 #SOCKETADAPTER: sockad_0 SINGLE 0

--- /etc/dis/networkmanager.conf.bak    2014-04-17 09:30:40.000000000 -0700
+++ /etc/dis/networkmanager.conf    2014-04-17 09:30:48.000000000 -0700
@@ -39,7 +39,7 @@
 # Number of nodes in X Dimension. If you are using a single ring, please
 # specify number of nodes in ring.
 
--dimensionX 2;
+-dimensionX 3;
 
 # Number of nodes in Y Dimension.

I then had to restart the DIS network manager to see these changes take affect:

$ sudo /etc/init.d/dis_networkmgr restart

I then rebooted c1ioo one more time, after which c1ioo showed up in the dxadmin GUI.

At this point I tried adding a dolphin IPC connection between c1als and c1lsc to see if it worked.  Unfortunately everything crashed every time I tried to run the models (including models on other machines!).  The problem was that I had forgotten to tell the c1ioo IOP (c1x03) to use PCIe RFM (i.e. Dolphin).  This is done by adding the following flag to the cdsParamters block in the IOP:

pciRfm=1

Once this was added, and the IOP was rebuilt/installed/restarted and came back up fine.  The c1als model with the dolphin output also came up fine.

However, at this point I ran into the c1ioo mx_stream problem and started backing everything out.

 

  9825   Thu Apr 17 17:15:54 2014 jamieUpdateCDSmx_stream not starting on c1ioo

While trying to get dolphin working on c1ioo, the c1ioo mx_stream processes mysteriously stopped working.  The mx_stream process itself just won't start now.  I have no idea why, or what could have happened to cause this change.  I was working on PCIe dolphin stuff, but have since backed out everything that I had done, and still the c1ioo mx_stream process will not start.

mx_stream relies on the open-mx kernel module, but that appears to be fine:

controls@c1ioo ~ 0$ /opt/open-mx/bin/omx_info  
Open-MX version 1.3.901
 build: root@fb:/root/open-mx-1.3.901 Wed Feb 23 11:13:17 PST 2011

Found 1 boards (32 max) supporting 32 endpoints each:
 c1ioo:0 (board #0 name eth1 addr 00:14:4f:40:64:25)
   managed by driver 'e1000'
   attached to numa node 0

Peer table is ready, mapper is 00:30:48:d6:11:17
================================================
  0) 00:14:4f:40:64:25 c1ioo:0
  1) 00:30:48:d6:11:17 c1iscey:0
  2) 00:25:90:0d:75:bb c1sus:0
  3) 00:30:48:be:11:5d c1iscex:0
  4) 00:30:48:bf:69:4f c1lsc:0
controls@c1ioo ~ 0$ 

However, if trying to start mx_stream now fails:

controls@c1ioo ~ 0$ /opt/rtcds/caltech/c1/target/fb/mx_stream -s c1x03 c1ioo c1als -d fb:0
c1x03
mmapped address is 0x7f885f576000
mapped at 0x7f885f576000
send len = 263596
OMX: Failed to find peer index of board 00:00:00:00:00:00 (Peer Not Found in the Table)
mx_connect failed
controls@c1ioo ~ 1$ 

I'm not quite sure how to interpret this error message.  The "00:00:00:00:00:00" has the form of a 48-bit MAC address that would be used for a hardware identifier, ala the second column of the OMC "peer table" above, although of course all zeros is not an actual address.  So there's some disconnect between mx_stream and the actually omx configuration stuff that's running underneath.

Again, I have no idea what happened.  I spoke to Rolf and he's going to try to help sort this out tomorrow.

  9826   Thu Apr 17 17:22:32 2014 JenneUpdateCDSmx_stream not starting on c1ioo, locking okay

Jamie tells me that the 2 big consequences of this are (a) we are not archiving any data that is collected on the ioo machine, and (b) that we will not have access to test points on the IOO or ALS models.

To make sure that this is not a show-stopper for locking, I have locked the arms using ALS.  The signals seem to still be getting from the ALS model to the LSC model, and I'm able to acquire ALS lock, so we should be able to work tonight.  All of the data that I have been looking at lately has been coming off of the LSC machine, so we should even be okay in terms of look-back for lockloss studies, etc.

 

  9830   Fri Apr 18 14:00:48 2014 rolfUpdateCDSmx_stream not starting on c1ioo

 

 To fix open-mx connection to c1ioo, had to restart the mx mapper on fb machine. Command is /opt/mx/sbin/mx_start_mapper, to be run as root. Once this was done, omx_info on c1ioo computer showed fb:0 in the table and mx_stream started back up on its own. 

  9831   Fri Apr 18 19:05:17 2014 jamieUpdateCDSmx_stream not starting on c1ioo

Quote:

To fix open-mx connection to c1ioo, had to restart the mx mapper on fb machine. Command is /opt/mx/sbin/mx_start_mapper, to be run as root. Once this was done, omx_info on c1ioo computer showed fb:0 in the table and mx_stream started back up on its own. 

Thanks so much Rolf (and Keith)!

  9839   Tue Apr 22 01:39:57 2014 JenneUpdateCDSFB unhappy again

[Jenne, Q]

The frame builder (or something) is unhappy again.  I know that we've seen this before, but I can't find the elog entry that relates to this particular problem.

Every few minutes, the fb status lights on the CDS_STATUS screen go white, and then come back green.  It's annoying when it happens every hour or so (which is unfortunately typical), but it's pretty debilitating when it stops dataviewer and dtt every few minutes.  Just from the way the lights change, it looks like perhaps the daqd process is restarting itself periodically? 

  9879   Wed Apr 30 14:21:50 2014 manasaUpdateCDSfb restarted

c1sus and c1isey were not talking to fb. The usual mxstream restart did not help.

Restarted fb

>>ssh fb

>>telnet fb 8087
shutdown

All lights on the FE status screen are green now.

Note that Steve did an mxstreamrestart earlier today because the same machines c1sus and c1isey were not talking to fb.

  9881   Wed Apr 30 17:07:19 2014 jamieUpdateCDSc1ioo now on Dolphin network

The c1ioo host is now fully on the dolphin network!

After the mx stream issue from two weeks ago was resolved and determined to not be due to the introduction of dolphin on c1ioo, I went ahead and re-installed the dolphin host adapter card on c1ioo.  The Dolphin network configurations changes I made during the first attempt (see previous log in thread) were still in place.  Once I rebooted the c1ioo machine, everything came up fine:

dolphin.png

We then tested the interface by making a cdsIPCx-PCIE connection between the c1ioo/c1als model and the c1lsc/c1lsc model for the ALS-X beat note fine phase signal.  We then locked both ALS X and Y, and compared the signals against the existing ALS-Y beat note phase connection that passes through c1sus/c1rfm via an RFM IPC:

The signal is perfectly coherent and we've gained ~25 degrees of phase at 1kHz.  EricQ calculates that the delay for this signal has changed from:

ALSXonDolphin.pdf

122 us -> 61 us 

I then went ahead and made the needed modifications for ALS-Y as well, and removed ALS->LSC stuff in the c1rfm model.

Next up: move the RFM card from the c1sus machine to the c1lsc machine, and eliminate c1sus/c1rfm model entirely.

  9882   Wed Apr 30 17:45:34 2014 jamieUpdateCDSc1ioo now on Dolphin network

For reference, here are the new IPC entries that were made for the ALS X/Y phase between c1als and c1lsc:

controls@fb ~ 0$ egrep -A5 'C1:ALS-(X|Y)_PHASE' /opt/rtcds/caltech/c1/chans/ipc/C1.ipc
[C1:ALS-Y_PHASE]
ipcType=PCIE
ipcRate=16384
ipcHost=c1ioo
ipcNum=114
desc=Automatically generated by feCodeGen.pl on 2014_Apr_17_14:27:41
--
[C1:ALS-X_PHASE]
ipcType=PCIE
ipcRate=16384
ipcHost=c1ioo
ipcNum=115
desc=Automatically generated by feCodeGen.pl on 2014_Apr_17_14:28:53
controls@fb ~ 0$ 

After all this IPC cleanup is done we should go through and clean out all the defunct entries from the C1.ipc file.

  9883   Wed Apr 30 18:06:06 2014 jamieUpdateCDSPOP QPD signals now on dolphin

The POP QPD X/Y/SUM signals, which are acquired in c1ioo, are now being broadcast over dolphin.  c1ass was modified to pick them up there as well:

c1ioo-POPQPD.pngc1ass-POPQPD.png

Here are the new IPC entries:

controls@fb ~ 0$ egrep -A5 'C1:IOO-POP' /opt/rtcds/caltech/c1/chans/ipc/C1.ipc
[C1:IOO-POP_QPD_SUM]
ipcType=PCIE
ipcRate=16384
ipcHost=c1ioo
ipcNum=116
desc=Automatically generated by feCodeGen.pl on 2014_Apr_30_17:33:22
--
[C1:IOO-POP_QPD_X]
ipcType=PCIE
ipcRate=16384
ipcHost=c1ioo
ipcNum=117
desc=Automatically generated by feCodeGen.pl on 2014_Apr_30_17:33:22
--
[C1:IOO-POP_QPD_Y]
ipcType=PCIE
ipcRate=16384
ipcHost=c1ioo
ipcNum=118
desc=Automatically generated by feCodeGen.pl on 2014_Apr_30_17:33:22
controls@fb ~ 0$ 

Both c1ioo and c1ass were rebuild/install/restarted, and everything came up fine.

The corresponding cruft was removed from c1rfm, which was also rebuild/installed/restarted.

  9890   Thu May 1 10:23:42 2014 jamieUpdateCDSc1ioo dolphin fiber nicely routed

Steve and I nicely routed the dolphin fiber from c1ioo in the 1X2 rack to the dolphin switch in the 1X4 rack.  I shutdown c1ioo before removing the fiber, but still all the dolphin connected models crashed.  After the fiber was run, I brought back c1ioo and restarted all wedged models.  Everything is green again:

green.png

  9896   Fri May 2 01:01:28 2014 ranaUpdateCDSc1ioo dolphin fiber nicely routed

This C1IOO business seems to be wiping out the MC2_TRANS QPD servo settings each day.   What kind of BURT is being done to recover our settings after each of these activities?

(also we had to do mxstream restart on c1sus twice so far tonight -- not unusual, just keeping track)

  9903   Fri May 2 11:14:47 2014 jamieUpdateCDSc1ioo dolphin fiber nicely routed

Quote:

This C1IOO business seems to be wiping out the MC2_TRANS QPD servo settings each day.   What kind of BURT is being done to recover our settings after each of these activities?

(also we had to do mxstream restart on c1sus twice so far tonight -- not unusual, just keeping track)

I don't see how the work I did would affect this stuff, but I'll look into it.  I didn't touch the MC2 trans QPD signals.  Also nothing I did has anything to do with BURT.  I didn't change any channels, I only swapped out the IPCs.

  9910   Mon May 5 19:34:54 2014 jamieUpdateCDSc1ioo/c1ioo control output IPCs changed to PCIE Dolphin

Now the c1ioo in on the Dolphin network, I changed the c1ioo MC{1,2,3}_{PIT,YAW} and MC{L,F} outputs to go out over the Dolphin network rather than the old RFM network.

Two models, c1mcs and c1oaf, are ultimately the consumers of these outputs.  Now they are picking up the new PCIE IPC channels directly, rather than from any sort of RFM/PCIE proxy hops.  This should improve the phase for these channels a bit, as well as reduce complexity and clutter.  More stuff was removed from c1rfm as well, moving us to the goal of getting rid of that model entirely.

c1ioo, c1mcs, and c1rfm were all rebuild/installed/restarted, and all came back fine.  The mode cleaner relocked once we reenabled the autolocker.

c1oaf, on the other hand, is not building.  It's not building even before the changes I attempted, though.  I tried reverting c1oaf back to what is in the SVN (which also corresponds to what is currently running) and it doesn't compile either:

controls@c1lsc ~ 2$ rtcds build c1oaf
buildd: /opt/rtcds/caltech/c1/rtbuild
### building c1oaf...
Cleaning c1oaf...
Done
Parsing the model c1oaf...
YARM_BLRMS_SEIS_CLASS TP
YARM_BLRMS_SEIS_CLASS_EQ TP
YARM_BLRMS_SEIS_CLASS_QUIET TP
YARM_BLRMS_SEIS_CLASS_TRUCK TP
YARM_BLRMS_S_CLASS EpicsOut
YARM_BLRMS_S_CLASS_EQ EpicsOut
YARM_BLRMS_S_CLASS_QUIET EpicsOut
YARM_BLRMS_S_CLASS_TRUCK EpicsOut
YARM_BLRMS_classify_seismic FunctionCall
Please check the model for missing links around these parts.
make[1]: *** [c1oaf] Error 1
make: *** [c1oaf] Error 1
controls@c1lsc ~ 2$ 

I've been trying to debug it but have had no success.  For the time being I'm shutting off the c1oaf model, since it's now looking for bogus signals on RFM, until we can figure out what's wrong with it. 

  9911   Mon May 5 19:51:56 2014 jamieUpdateCDSc1oaf model broken because of broken BLRMS block

I finally tracked down the problem with the c1oaf model to the BLRMS part:

/opt/rtcds/userapps/release/cds/common/models/BLRMS.mdl

blrms-hot-mess.pngsddefault.jpg

Note that this is pulling from a cds/common location, so presumably this is a part that's also being used at the sites.

Either there was an svn up that pulled in something new and broken, or the local version is broken, or who knows what.

We'll have to figure how what's going on here, but in the mean time, as I already mentioned, I'm leaving the c1oaf model off for now.

 RXA: also...we updated Ottavia to Ubuntu 12 LTS...but now it has no working network connection. Needs help.  (which of course has nothing whatsoever to do with this point )

  9915   Tue May 6 10:22:28 2014 steveUpdateCDSc1ioo dolphin fiber

Quote:

Steve and I nicely routed the dolphin fiber from c1ioo in the 1X2 rack to the dolphin switch in the 1X4 rack.  I shutdown c1ioo before removing the fiber, but still all the dolphin connected models crashed.  After the fiber was run, I brought back c1ioo and restarted all wedged models.  Everything is green again:

green.png

 I put label  at the dolphin fiber end at 1X2 today.   After this I had to reset it, but it failed.

  9916   Tue May 6 10:31:58 2014 jamieUpdateCDSc1ioo dolphin fiber

Quote:

I put label  at the dolphin fiber end at 1X2 today.   After this I had to reset it, but it failed.

 If by "fail" you're talking about the c1oaf model being off-line, I did that yesterday (see log 9910).  That probably has nothing to do with whatever you did today, Steve.

  9922   Wed May 7 16:31:12 2014 jamieUpdateCDScdsutils updated to version 226
controls@pianosa:~ 0$ cd /opt/rtcds/cdsutils/trunk/
controls@pianosa:/opt/rtcds/cdsutils/trunk 0$ svn update
...
At revision 226.
controls@pianosa:/opt/rtcds/cdsutils/trunk 0$ make
echo "__version__ = '226'" >lib/cdsutils/_version.py
echo "__version__ = '226'" >lib/ezca/_version.py
...
controls@pianosa:/opt/rtcds/cdsutils/trunk 0$ make ligo-install
python ./setup.py install --prefix=/ligo/apps/linux-x86_64/cdsutils-226
...
controls@pianosa:/opt/rtcds/cdsutils/trunk 0$ ln -sfn cdsutils-226 /ligo/apps/linux-x86_64/cdsutils
controls@pianosa:/opt/rtcds/cdsutils/trunk 0$ exit
...
controls@pianosa:~ 0$ cdsutils --version
cdsutils 226
controls@pianosa:~ 0$ 

  9923   Wed May 7 17:10:59 2014 ranaUpdateCDScdsutils updated to version 226

 This upgrade from Jamie has given us the new apps (avg, servo, and trigservo). We should figure out if there's a way to integrate Masayuki's work, so that we can have a 'cdsutils demod' function too.

  9924   Wed May 7 22:47:33 2014 ranaUpdateCDScdsutils updated to version 226: not working on pianosa or rossa

 controls@rossa:~ 0$ cdsutils read C1:LSC-DARM_GAIN
Traceback (most recent call last):
  File "/usr/lib/python2.6/runpy.py", line 122, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.6/runpy.py", line 34, in _run_code
    exec code in run_globals
  File "/ligo/apps/cdsutils-226/lib/python2.6/site-packages/cdsutils/__main__.py", line 57, in <module>
    imp.load_module('__main__', f, pathname, description)
  File "/ligo/apps/cdsutils-226/lib/python2.6/site-packages/cdsutils/read.py", line 32, in <module>
    print ezca.Ezca(prefix).read(rest, as_string=args.as_string)
  File "/ligo/apps/linux-x86_64/cdsutils-226/lib/python2.6/site-packages/ezca/cached.py", line 17, in __call__
    key = (args, tuple(kwargs.viewitems()))
AttributeError: 'dict' object has no attribute 'viewitems'

  9926   Wed May 7 23:30:21 2014 jamieUpdateCDScdsutils should be working now

Should be fixed now.  There were python2.6 compatibility issues, which only show up on these old distros (e.g. ubuntu 10.04).

controls@pianosa:~ 0$ cdsutils read C1:LSC-DARM_GAIN
0.0
controls@pianosa:~ 0$ cdsutils --version
cdsutils 230
controls@pianosa:~ 0$ 
  9928   Thu May 8 01:33:21 2014 ericqUpdateCDSpython issues

On pianosa: The ezca.Ezca class somehow initializes with its prefix set to "C1:", even though the docstring says the default is None. This makes existing scripts act wonky, because they're looking for channels like "C1:C1:FO-BLAH".

In ligo/apps/linux-x86_64, I ran ln -sfn cdsutils-old cdsutils to get the old version back for now, so I don't have to edit all of our up/down scripts.

Also, Chiara can't find the epics package when I try to load Ezca. It exists in '/usr/lib/pymodules/python2.6/epics/__init__.pyc' on pianosa, but there is no corresponding 2.7 folder on chiara.

 

  9931   Thu May 8 15:55:43 2014 jamieUpdateCDSpython issues

Quote:

On pianosa: The ezca.Ezca class somehow initializes with its prefix set to "C1:", even though the docstring says the default is None. This makes existing scripts act wonky, because they're looking for channels like "C1:C1:FO-BLAH".

In ligo/apps/linux-x86_64, I ran ln -sfn cdsutils-old cdsutils to get the old version back for now, so I don't have to edit all of our up/down scripts.

Also, Chiara can't find the epics package when I try to load Ezca. It exists in '/usr/lib/pymodules/python2.6/epics/__init__.pyc' on pianosa, but there is no corresponding 2.7 folder on chiara.

I just pushed a fix to ezca to allow for having a truly empty prefix even if the IFO env var is set:

controls@pianosa:~ 0$ ipython
Python 2.6.5 (r265:79063, Feb 27 2014, 19:43:51) 
Type "copyright", "credits" or "license" for more information.

IPython 0.10 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object'. ?object also works, ?? prints more.

In [1]: import ezca

In [2]: ezca.Ezca()
Out[2]: Ezca(prefix='C1:')

In [3]: ezca.Ezca(ifo=None)
Out[3]: Ezca(prefix='')

In [4]: ezca.Ezca(ifo=None).read('C1:LSC-DARM_GAIN')
Out[4]: 0.0

This is in cdsutils r232, which I just installed at the 40m.  I linked it in as well, so it's now the default version.  You will have to make a modification to any python scripts utilizing the Ezca object, but now it's a much smaller change (just in the invocation line):

-ca = ezca.Ezca()
+ca = ezca.Ezca(ifo=None)

 

  9949   Tue May 13 17:45:21 2014 ranaUpdateCDS/frames space cleared up, daqd stabilized

 

 Late last night we were getting some problems with DAQD again. Turned out to be /frames getting full again.

I deleted a bunch of old frame files by hand around 3AM to be able to keep locking quickly and then also ran the wiper script (target/fb/wiper.pl).

controls@pianosa|fb> df -h; date

Filesystem            Size  Used Avail Use% Mounted on

/dev/sda1             440G  9.7G  408G   3% /

none                  7.9G  288K  7.9G   1% /dev

none                  7.9G  464K  7.9G   1% /dev/shm

none                  7.9G  144K  7.9G   1% /var/run

none                  7.9G     0  7.9G   0% /var/lock

none                  7.9G     0  7.9G   0% /lib/init/rw

none                  440G  9.7G  408G   3% /var/lib/ureadahead/debugfs

linux1:/home/cds      1.8T  1.4T  325G  82% /cvs/cds

linux1:/ligo           71G   18G   50G  27% /ligo

linux1:/home/cds/rtcds

                      1.8T  1.4T  325G  82% /opt/rtcds

fb:/frames        13T   12T  559G  96% /frames

linux1:/home/cds/caltech/users

                      1.8T  1.4T  325G  82% /users

Tue May 13 17:35:00 PDT 2014

Looking through the directories by hand it seems that the issue may be due to our FB MXstream instabilities. The wiper looks at the disk usage and tries to delete just enough files to keep us below 95% full for the next 24 hours. If, however, some of the channels are not being written because some front ends are not writing their DAQ channels to frames, then it will misestimate the disk size. In particular, if its currently writing small frames and then we restart the mxstream and the per frame file size goes back up to 80 MB, it can make the disk full.

For now, I have modified the wiper.pl script to try to stay below 93%. As you can see by the above output of 'df', it is already above 96% and it still has files to write until the next run of wiper.pl 7 hours from now at. at 6 AM.

IF we assume that its writing a 75MB file every 16 seconds, then it would write 405 GB of frames every day. There is 559 GB free right now so we are OK for now. With 405 GB of usage per day, we have a lookback of ~12TB/405GB ~ 29 days (ignoring the trend files).

  9954   Wed May 14 17:36:32 2014 ericqUpdateCDSNew netgpib scripts for SR785

 I have redone the SPSR785 (spectrum measurement) and TFSR785 (TF measurement) commands in scripts/general/netgpibdata. This was mostly motivated by my frustration with typing out either a ton of command line arguments, or rooting around in the script itself; I'd rather just have a static file where I define the measurement, and can keep track of easier. 

They currently take one argument: a parameter file where all the measurement details are specified. (i.e. IP address, frequencies, etc.) There are a few template files in the same directory that they use as default. (Such as TFSR785template.yml)

If you call the functions with the option '--template', it will copy a template file into your working directory for you to modify as you wish. "SPSR785 -h" gives you some information as well (currently minimal, but I'll be adding more)

In the parameter file, you can also ask for the data to be plotted (and saved as pdf) when the measurement is finished. In SPSR785, and soon TFSR785, you can specify a directory where the script will look for reference traces to plot along with the results, presuming they were taken with the same measurement parameters and have the same filename stem. 

I've tested both on Pianosa, and they seem to work as expected. 

Todo:

  • Add support for modifying some parameters at the command line 
  • Extend to the Agilent analyzer
  • Maybe the analyzer settings written to the output file should be verified by GPIB query, instead of writing out the intended settings. (I've never seen them go wrong, though)
  • Make sure that the analyzer has PSD units off when taking a TF. (Thought I could use resetSR785 for this, but there's some funkiness happening with that script currently.)
  • Possibly unify into one script that sees what kind of analyzer you're requesting, and then passes of to the device/measurement type specific script, so we don't have to remember many commands. 

Comments, criticism, and requests are very welcome. 

(P.S. all the random measurement files and plots that were in netgpibdata are now in netgpibdata/junk. I feel like this isn't really a good place to be keeping data. Old versions of the scripts I changed are in netgpibdata/oldScripts)

  9955   Thu May 15 01:42:07 2014 ranaUpdateCDS/frames space cleared up, daqd stabilized

 Script seems to be working now:

nodus:~>df -h | grep frames

fb:/frames              13T    12T   931G    93%    /frames

  9982   Wed May 21 13:18:47 2014 ericqUpdateCDSSuspension MEDM Bug

I fixed a bug in the SUS_SINGLE screen, where the total YAW output was incorrectly displayed (TO_COIL_3_1 instead of TO_COIL_1_3). I noticed this by seeing that the yaw bias slider had no effect on the number that claimed to be the yaw sum. The first time I did this, I accidently changed the screen size a bit which smushed things together, but that's fixed now.

I committed it to the svn, along with some uncommitted changed to the oplev servo screen.

  10010   Mon Jun 9 11:42:00 2014 JenneUpdateCDSComputer status

Current computer status:

All fast machines except c1iscey are up and running. I can't ssh to c1iscey, so I'll need to go down to the end station and have a look-see. On the c1lsc machine, neither the c1oaf nor the c1cal models are running (but for the oaf model, we know that this is because we need to revert the blrms block changes to some earlier version, see Jamie's elog 9911).

Daqd process is running on framebuilder.  However, when I try to open dataviewer, I get the popup error saying "Can't connect to rb", as well as an error in the terminal window that said something like "Error getting chan info".

Slow machines c1psl, c1auxex and c1auxey are not running (can't telnet to them, and white boxes on related medm screens for slow channels).  All other slow machines seem to be running, however nothing has been done to them to point them at the new location of the shared hard drive, so their status isn't ready to green-light yet.


Things that we did on Friday for the fast machines:

The shared hard drive is "physically" on Chiara, at /home/cds/.  Links are in place so that it looks like it's at the same place that it used to be:  /opt/rtcds/...... 

The first nameserver on all of the workstation machines inside of the file /etc/resolv.conf has been changed to be 192.168.113.104, which is Chiara's IP address (it used to be 192.168.113.20, which was linux1).  This change has also been made on the framebuilder, and in the framebuilder's /diskless/root/etc/resolv.conf file, which is what all of the fast front ends look to. 

On the framebuilder, and in the /diskless place for the fast front ends, presumably we must have changed something to point at the new location for the shared drive, but I don't remember how we did that [ERIC, what did we do???]


The slow front ends that we have tried changing have not worked out. 

First, we tried plugging a keyboard and monitor into c1auxey.  When we key the crate to reboot the machine, we get some error message about a "disk A drive error", but then it goes on to prompt pushing F1 for something, and F2 for entering setup.  No matter what we press, nothing happens.  c1auxey is still not running.

We were able to telnet into c1auxex, c1psl, and c1iool0.  On each of those machines, at the prompt, we used the command "bootChange".  This initially gives us a series of:

$ telnet c1susaux
Trying 192.168.113.55...
Connected to c1susaux.
Escape character is '^]'.

c1susaux > bootChange

'.' = clear field;  '-' = go to previous field;  ^D = quit

boot device          : ei
processor number     : 0
host name            : linux1
file name            : /cvs/cds/vw/mv162-262-16M/vxWorks
inet on ethernet (e) : 192.168.113.55:ffffff00
inet on backplane (b):
host inet (h)        : 192.168.113.20
gateway inet (g)     :
user (u)             : controls
ftp password (pw) (blank = use rsh):
flags (f)            : 0x0
target name (tn)     : c1susaux
startup script (s)   : /cvs/cds/caltech/target/c1susaux/startup.cmd
other (o)            :

value = 0 = 0x0
c1susaux >

If we go through that again (it comes up line-by-line, and you must press Enter to go to the next line) and put a period a the end of the Host Name line, and the Host Inet (h) line, they will come up blank the next time around.  So, the next time you run bootChange, you can type "chiara" for the host name, and "192.168.113.104" for the "host inet (h)".  If you run bootChange one more time, you'll see that the new things are in there, so that's good.

However, when we then try to reboot the computer, I think the machines weren't coming back after this point.  (Unfortunately, this is one of those things that I should have elogged back on Friday, since I don't remember precisely).  Certainly whatever the effect was, it wasn't what I wanted, and I left with the machines that I had tried rebooting, not running.

  10011   Mon Jun 9 12:19:17 2014 ericqUpdateCDSComputer status

Quote:

The first nameserver on all of the workstation machines inside of the file /etc/resolv.conf has been changed to be 192.168.113.104, which is Chiara's IP address (it used to be 192.168.113.20, which was linux1).  This change has also been made on the framebuilder, and in the framebuilder's /diskless/root/etc/resolv.conf file, which is what all of the fast front ends look to. 

On the framebuilder, and in the /diskless place for the fast front ends, presumably we must have changed something to point at the new location for the shared drive, but I don't remember how we did that [ERIC, what did we do???]

In all of the fstabs, we're using chiara's IP instead of name, so that if the nameserver part isn't working, we can still get the NFS mounts.

On control room computers, we mount the NFS through /etc/fstab having lines like:

192.168.113.104:/home/cds /cvs/cds nfs rw,bg 0 0
fb:/frames /frames nfs ro,bg 0 0

Then, things like /cvs/cds/foo are locally symlinked to /opt/foo

For the diskless machines, we edited the files in /diskless/root. On FB, /diskless/root/etc/fstab becomes

master:/diskless/root                   /         nfs     sync,hard,intr,rw,nolock,rsize=8192,wsize=8192    0 0
master:/usr                             /usr      nfs     sync,hard,intr,ro,nolock,rsize=8192,wsize=8192    0 0
master:/home                            /home     nfs     sync,hard,intr,rw,nolock,rsize=8192,wsize=8192    0 0
none                                    /proc     proc    defaults          0 0
none                                    /var/log        tmpfs   size=100m,rw    0 0
none                                    /var/lib/init.d tmpfs   size=100m,rw    0 0
none                                    /dev/pts        devpts  rw,nosuid,noexec,relatime,gid=5,mode=620        0 0
none                                    /sys            sysfs   defaults        0 0
master:/opt                             /opt      nfs    async,hard,intr,rw,nolock  0 0
192.168.113.104:/home/cds/rtcds         /opt/rtcds      nfs     nolock  0 0
192.168.113.104:/home/cds/rtapps        /opt/rtapps     nfs     nolock  0 0

("master" is defined in /diskless/root/etc/hosts to be 192.168.113.202, which is fb's IP)

and /diskless/root/etc/resolv.conf becomes:

search martian

nameserver 192.168.113.104 #Chiara

 

 

  10015   Mon Jun 9 22:26:44 2014 rana, zachUpdateCDSSLOW controls recovery

 All of the SLOW computers were in limbo since the fileserver/nameserver change, but me and Zach brought them back.

One of the troubles, was that we were unable to telnet into these computers once they failed to boot (due to not having a connection to their bootserver).

  1. Needed special DB9-RJ45 cable to connect from (old) laptop serial ports to the Motorola VME162 machines (e.g. c1psl, c1iool0, c1aux, etc.); thanks to Dave Barker for sending me the details on how to make these. Tara found 2 of these that Frank or PeterK had left there and saved us a huge hassle. Most new laptops don't have a serial port, but in principle there's a way to do this by using one of our USB-Serial adapters. We didn't try this, but just used an old laptop. The RJ45 connector must go into the top connector of the bottom 4; its labeled as 'console' on some of the VME computers. Thanks to K. Thorne for this very helpful hint and to Rolf for pointing me to KT.
  2. Installed 'minicom' on these machnes to allow communication via the serial port.
  3. Had to install RSH on chiara to allow the VME computers to connect to it. Also added the names of all the slow machines in /etc/hosts.equiv to allow for password-less login. Without this they were not able to load the vxWorks binary. It was tricky to get RSH to work, since its an insecure and deprecated service. 'rsh-server' doesn't work, but installing 'rsh-redone-server' did finally work for passwordless access. Must be that linux1 has RSH enabled, but of course, this was undocumented.
  4. Some of the SLOW machines didn't have their own target names or startup.cmd in their startup boot parameters (???). I fixed these.
  5. For C1VAC1, I have updated the boot parameters via bootChange, but I have not rebooted it. Waiting to do so when Koji and Steve are both around. We should make sure to not forget doing this on C1VAC2. Steve always tells us that it never works, but actually it does. It just crashes every so often.
  6. Leaving C1AUXEX and C1AUXEY for Q and Jacy to do, to see if this ELOG is good enough.
  7. The PSL crate still starts up with a SysFail light turned on red, but that doesn't seem to bother the c1psl operation. We (Steve) should go around and put a label on all the crates where SysFail is lit during 'normal' operation. Misleading warning lights are a bad thing.

We still don't have control completely of the MC Servo board, so we need the morning crew to start checking that out

An example session (using telnet, not the laptop/serial way) where we use bootChange to examine the correct c1aux config:

controls@pianosa|target> telnet c1aux
Trying 192.168.113.61...
Connected to c1aux.martian.
Escape character is '^]'.

c1aux > bootChange

'.' = clear field;  '-' = go to previous field;  ^D = quit

boot device          : ei
processor number     : 0
host name            : chiara
file name            : /cvs/cds/vw/mv162-262-16M/vxWorks
inet on ethernet (e) : 192.168.113.61:ffffff00
inet on backplane (b):
host inet (h)        : 192.168.113.104
gateway inet (g)     :
user (u)             : controls
ftp password (pw) (blank = use rsh):
flags (f)            : 0x0
target name (tn)     : c1aux
startup script (s)   : /cvs/cds/caltech/target/c1aux/startup.cmd
other (o)            :

value = 0 = 0x0
c1aux >

  10016   Mon Jun 9 22:40:36 2014 JenneUpdateCDSFast front end computers up

Rana and I now seem to have the fast front end computers (c1lsc, c1sus, c1ioo, c1iscex and c1iscey) up and running!  Hooray!

It seemed that we needed to change the soft links back to hard links for rtcds and rtapps on the front end machines.  On c1ioo, we did:

cd /opt

sudo rm -rf rtcds

sudo rm -rf rtapps

sudo mkdir rtcds

sudo mkdir rtapps

sudo chown controls:1001 rtcds

sudo chown controls:1001 rtapps

mount /opt/rtcds

mount /opt/rtapps

At this time, the front end fstab had several other options in addition to "nolock" for both rtcds and rtapps.  They had rw,bg,user,nolock.  This state still had some permissions problems.  (Later, we have decided that perhaps our next step was unneccesary, since it still left us with (fewer) permissions problems. Taking out the rw,bg,user options from the front end fstab seems to have fixed all permissions issues, so maybe this next chmod step didn't need to be done.  But it was done, so I record it for completeness).

On chiara, we did:

cd /home/cds/rtcds

sudo chmod -R 777 *

Then on c1iscex, I didn't have to deal with the soft links, but I did need to mount the rtcds and rtapps directories so that I could see files in them.  I just did the last 2 operations from the c1ioo list above (mount /opt/rtcds and mount /opt/rtapps). 

Since we were still seeing some (fewer) permissions problems, we took out the extra options in the front ends' fstab that Rana had added.  Rebooting c1iscex after this, everything came back as expected.  Nice!

I think that, at this point, remotely rebooting (sudo shutdown -r now) the other front ends made everything come back nicely. Since we had gotten the fstab situation correct, we didn't have to by-hand mount any directories, and all of the models restarted on their own.  Finally!

 

 


For posterity, here are things that we'll want to remember:

Frame builder's fstab, in /etc/fstab (only the uncommented lines, since there are lots of comments):

/dev/sdb1               /               ext3            noatime         0 1
/swapfile               none            swap            sw              0 0
shm                     /dev/shm        tmpfs           nodev,nosuid,noexec     0 0
/dev/sda1               /frames         ext3    noatime         0 0
192.168.113.104:/home/cds/                      /cvs/cds        nfs     _netdev,auto,rw,bg,soft      0 0
192.168.113.104:/home/cds/rtcds                  /opt/rtcds     nfs     _netdev,auto,rw,bg,soft 0 0
192.168.113.104:/home/cds/rtapps                 /opt/rtapps    nfs     _netdev,auto,rw,bg,soft 0 0

Fast front end fstabs, which are on the framebuilder in /diskless/root/etc/fstab:

master:/diskless/root                   /               nfs     sync,hard,intr,rw,nolock,rsize=8192,wsize=8192    0 0
master:/usr                             /usr            nfs     sync,hard,intr,ro,nolock,rsize=8192,wsize=8192    0 0
master:/home                            /home           nfs     sync,hard,intr,rw,nolock,rsize=8192,wsize=8192    0 0
none                                    /proc           proc    defaults          0 0
none                                    /var/log        tmpfs   size=100m,rw    0 0
none                                    /var/lib/init.d tmpfs   size=100m,rw    0 0
none                                    /dev/pts        devpts  rw,nosuid,noexec,relatime,gid=5,mode=620        0 0
none                                    /sys            sysfs   defaults        0 0
master:/opt                             /opt            nfs     async,hard,intr,rw,nolock  0 0
192.168.113.104:/home/cds/rtcds         /opt/rtcds      nfs     nolock                     0 0
192.168.113.104:/home/cds/rtapps        /opt/rtapps     nfs     nolock                     0 0

  10018   Tue Jun 10 09:25:29 2014 JamieUpdateCDSComputer status: should not be changing names

I really think it's a bad idea to be making all these names changes.  You're making things much much harder for yourselves.

Instead of repointing everything to a new host, you should have just changed the DNS to point the name "linux1" to the IP address of the new server.  That way you wouldn't need to reconfigure all of the clients.  That's the whole point of  name service: use a name so that you don't need to point to a number.

Also, pointing to an IP address for this stuff is not a good idea.  If the IP address of the server changes, everything will break again.

Just point everything to linux1, and make the DNS entries for linux1 point to the IP address of chiara.  You're doing all this work for nothing!

RXA: Of course, I understand what DNS means. I wanted to make the changes to the startup to remove any misconfigurations or spaghetti mount situations (of which we found many). The way the VME162 are designed, changing the name doesn't make the fix - it uses the number instead. And, of course, the main issue was not the DNS, but just that we had to setup RSH on the new machine. This is all detailed in the ELOG entries we've made, but it might be difficult to understand remotely if you are not familiar with the 40m CDS system.

  10025   Wed Jun 11 14:36:57 2014 JenneUpdateCDSSLOW controls recovery

 

 I have brought back c1auxex and c1auxey.  Hopefully this elog will have some more details to add to Rana's elog 10015, so that in the end, we have the whole process documented.

The old Dell computer was already in a Minicom session, so I didn't have to start that up - hopefully it's just as easy as opening the program.  (Edit, JCD, 9July2014:  Startup a terminal session, and then type "minicom" and press enter to get a Minicom session).

I plugged the DB9-RJ45 cable into the top of the RJ45 jacks on the computers.  Since the aux end station computers hadn't had their bootChanges done yet, the prompt was "VxWorks Boot".  For a computer that was already configured, for example the psl machine, the prompt was "c1psl", the name of the machine.  So, the indication that work needs to be done is either you get the Boot prompt, or the computer starts to hang while it's trying to load the operating system (since it's not where the computer expects it to be).  If the computer is hanging, key the crate again to power cycle it.  When it gets to the countdown that says "press any key to enter manual boot" or something like that, push some key.  This will get you to the "VxWorks Boot" prompt. 

Once you have this prompt, press "?" to get the boot help menu.  Press "p" to print the current boot parameters (the same list of things that you see with the bootChange command when you telnet in).  Press "c" to go line-by-line through the parameters with the option to change parameters.  I discovered that you can just type what you want the parameter to be next to the old value, and that will change the value.  (ex.  "host name   : linux1   chiara"   will change the host name from the old value of linux1 to the new value that you just typed of chiara). 

After changing the appropriate parameters (as with all the other slow computers, just the [host name] and the [host inet] parameters needed changing), key the crate one more time and let it boot.  It should boot successfully, and when it has finished and given you the name for the prompt (ex. c1auxex), you can just pull out the RJ45 end of the cable from the computer, and move on to the next one.

 

  10027   Wed Jun 11 15:57:18 2014 JenneUpdateCDSNote on cables for talking to slow computers

We have (now) in the lab 2 cables that are RJ45-DB9.  The gray one is LIGO-made, while the blue one is store-bought.  

The gray LIGO-made one works, but the blue store-bought one does not.  I checked their pinouts, and they are completely different.  On the sketch below, the pictures of the connectors is me looking at them face-on, with the cables going out the back of the page.  The DB9 is female. 

06111401.PDF

  10028   Wed Jun 11 16:01:31 2014 SteveUpdateCDSc1Vac1 and c1vac2 rebooted

Quote:

 

 I have brought back c1auxex and c1auxey.  Hopefully this elog will have some more details to add to Rana's elog 10015, so that in the end, we have the whole process documented.

The old Dell computer was already in a Minicom session, so I didn't have to start that up - hopefully it's just as easy as opening the program.

I plugged the DB9-RJ45 cable into the top of the RJ45 jacks on the computers.  Since the aux end station computers hadn't had their bootChanges done yet, the prompt was "VxWorks Boot" (or something like that).  For a computer that was already configured, for example the psl machine, the prompt was "c1psl", the name of the machine.  So, the indication that work needs to be done is either you get the Boot prompt, or the computer starts to hang while it's trying to load the operating system (since it's not where the computer expects it to be).  If the computer is hanging, key the crate again to power cycle it.  When it gets to the countdown that says "press any key to enter manual boot" or something like that, push some key.  This will get you to the "VxWorks Boot" prompt. 

Once you have this prompt, press "?" to get the boot help menu.  Press "p" to print the current boot parameters (the same list of things that you see with the bootChange command when you telnet in).  Press "c" to go line-by-line through the parameters with the option to change parameters.  I discovered that you can just type what you want the parameter to be next to the old value, and that will change the value.  (ex.  "host name   : linux1   chiara"   will change the host name from the old value of linux1 to the new value that you just typed of chiara). 

After changing the appropriate parameters (as with all the other slow computers, just the [host name] and the [host inet] parameters needed changing), key the crate one more time and let it boot.  It should boot successfully, and when it has finished and given you the name for the prompt (ex. c1auxex), you can just pull out the RJ45 end of the cable from the computer, and move on to the next one.

 

 Koji, Jenne and Steve

 

Preparation to reboot:

1, closed VA6, V5 disconnected cable to valves ( closed all annuloses )

2, closed V1, disconnected it and stopped Maglev rotation

3, closed V4, disconnected its cable

   See Atm1,  This set up is insured us so there can not be any accidental valve switching to vent the vacuum envelope if reboot-caos strikes.[moving=disconnected]

4, RESET c1Vac1 and c1Vac2 one by one and together. They both went at once. We did NOT power recycled.

    Jenne entered the new "carma" words on  the old Dell laptop and checked the good answers. The reboot was done.

    Note: c1Vac1 green-RUN indicator LED is yellow. It is fine as yellow.

5, Checked and TOGGLED valve positions to be correct value ( We did not correct the the small turbo pumps monitor positions, but they  were alive )

6,  V4 was reconnected and opened. Maglev was started.

7,  V1 cable reconnected and opened at full rotation speed of 560 Hz

8,  V5 cable reconnected,  valve opened..............VA6 cable connected and opened........

9,   Vacuum Normal valve configuration was reached.

 

  10033   Thu Jun 12 15:31:47 2014 JamieUpdateCDSNote on cables for talking to slow computers

Quote:

We have (now) in the lab 2 cables that are RJ45-DB9.  The gray one is LIGO-made, while the blue one is store-bought.  

The gray LIGO-made one works, but the blue store-bought one does not.  I checked their pinouts, and they are completely different.  On the sketch below, the pictures of the connectors is me looking at them face-on, with the cables going out the back of the page.  The DB9 is female. 

 There are RJ45-DB9 adapters in the big spinny rack next to the linux1 rack that are for this exact purpose.  Just use a stanard ethernet cable with them.

  10040   Sun Jun 15 14:26:30 2014 JamieOmnistructureCDScdsutils re-installed

Quote:

 CDSUTILS is also gone from the path on all the workstations, so we need Jamie to tell us by ELOG how to set it up, or else we have to use ezcaread / ezcawrite forever.

It's in the elog already: http://nodus.ligo.caltech.edu:8080/40m/9922

But it seems like things still haven't fully recovered, or have recovered to an old state?  Why is the cdsutils install I previously did in /ligo/apps now missing?  It seems like other directories are missing as well.

There's also a user:group issue with the /home/cds mounts.  Everything in those mount points is owned nobody:nogroup.

I also can't log into pianosa and rosalba.

  10041   Sun Jun 15 14:41:08 2014 JamieOmnistructureCDScdsutils re-installed

Quote:

Quote:

 CDSUTILS is also gone from the path on all the workstations, so we need Jamie to tell us by ELOG how to set it up, or else we have to use ezcaread / ezcawrite forever.

It's in the elog already: http://nodus.ligo.caltech.edu:8080/40m/9922

But it seems like things still haven't fully recovered, or have recovered to an old state?  Why is the cdsutils install I previously did in /ligo/apps now missing?  It seems like other directories are missing as well.

There's also a user:group issue with the /home/cds mounts.  Everything in those mount points is owned nobody:nogroup.

I also can't log into pianosa and rosalba.

 I also still think it's a bad idea for everything to be mounting /home/cds from an IP address.  Just make a new DNS entry for linux1 and leave everything as it was.

  10057   Wed Jun 18 13:39:01 2014 ranaUpdateCDScdsutils reverted to version 238

 

 After some email consult with Jamie, I got cdsutils working again by reverting to v238. The newest versions are not compatible with our python 2.6 on Ubuntu 10. And our Ubuntu 12 machines do not have NDS2 clients that work yet.

The read/write commands work at the moment, but the NDS based ones don't yet work on pianosa due to some NDSSERVER flag/setup issue maybe, Jamie ??

controls@pianosa|~ > z

usage: cdsutils <cmd> <args>

Advanced LIGO Control Room Utilites

Available commands:

    read         read EPICS channel value

  write        write EPICS channel value

  switch       switch buttons in standard LIGO filter module

  avg          average one or more NDS channels for a specified amount of seconds

  servo        servos channel with a simple integrator (pole at zero)

  trigservo    servos channel with a simple integrator (pole at zero)


  version      print version info and exit

  help         this help


Add '-h' after individual commands for command help.

controls@pianosa|~ > z read C1:LSC-DARM_GAIN

0.0

controls@pianosa|~ 2> z avg 3 C1:IOO-MC_F

Error in write(): Connection refused

Error in write(): Connection refused

NDSSERVER variable incorrectly defined, or no NDS servers available.

controls@pianosa|~ 1> echo $NDSSERVER

fb

  10067   Wed Jun 18 22:47:48 2014 ericqUpdateCDSRaspberry pi added to martian network

I set up a raspberry pi on the martian network, to be hooked up to a frequency counter for tracking ALS beatnotes. 

The instructions at https://wiki-40m.ligo.caltech.edu/Martian_Host_Table are outdated, the name server configuration is now at /etc/bind/zones/martian.db, I need to remember to update the wiki soon. 

In any case, the raspberry pi is called "domenica," is found at 192.168.113.107, and has the standard controls user, with /cvs/cds mounted in the same way as the control room machines. 

Once I'm comfortable with the configuration of the pi, I'm going to take an image of the SD card that serves as its hard drive, so that we can just image new cards for future raspberry pis on the martian network if we ever want them. 

  10088   Mon Jun 23 20:58:38 2014 ericqUpdateCDSBootfest 2014!

This afternoon, I wanted to start the nominal alignment/adjustment steps for evening time locking, but got sucked into CDS frustrations. 

Primary symptom: TRX and TRY signals were not making it from C1:SUS-ETMX_TR[X,Y]_OUT to C1:LSC-TR[X,Y]_IN1. Various RFM bits were red on the CDS status page. 

 Secondary symptom: ITMX was randomly getting a good sized kick for no apparent reason. I still don't know what was behind this. 

First fix attempt: run sudo ntpdate -b -s -u pool.ntp.org on c1sus and c1lsc front ends, to see if NTP issues were responsible. No result.

Second fix attempt: Restart c1lsc, c1sus and c1rfm models. No change

Next fix attempt: Restart c1lsc and c1sus frontend machines. c1lsc models come back, c1sus models fail to sync / time out/ dmesg has some weird message about ADC channel hopping. At this point, c1ioo, c1iscey and c1iscex all have their models stop working due to sync problems. 

I then ran the above ntp command on all front ends and the FB, and restarted everyone's models (except c1lsc, who stayed working from here on out) which didn't change anything. I command-line rebooted all front ends (except c1lsc) and the FB (which had some dmesg messages about daqd segfaulting, but daqd issues weren't the problem). Still nothing. 

Finally, Koji came along and relieved me from my agony by hard rebooting all of the front ends; pulling out their power cables and seeing the life in their lights fade away... He did this first with the end station machines (c1iscey and c1iscex), and we saw them come back up perfectly happy, and then c1ioo and c1sus followed. At this point, all models came back; green RFM bits abounding, and TR[X,Y] signals propagating through as desired. 

Then, we tried turning the damping/watchdogs back on, which for some strange reason started shaking the hell out of everyone except the ETMs and ITMX. We restarted c1sus and c1mcs, and then damping worked again. Maybe a bad BURT restore was to blame?

At this point, all models were happy, all optics were damped, mode cleaner + WFS locked happily, but no beams were to be seen in the IFO 

The Yarm green would lock fine though, so tip-tilt alignment is probably to blame. I then left the interferometer to Jenne and Koji. 

  10109   Fri Jun 27 20:52:30 2014 KojiUpdateCDSOTTAVIA was not on network

I came in the lab. Found bunch of white EPICS boxes on ottavia.
It turned out that only ottavia was kicked out from the network.

After some struggle, I figured out that ottavia needs the ethernet cable unplugged / plugged
to connect (or reconnect) to the network.

For some unknown reason, ottavia was isolated from the martian network and couldn't come back.
This caused the MC autolocker frozen.

I logged in to megatron from ottavia, and ran at .../scritpt/MC

nohup ./AutoLockMC.csh &

Now the MC is happy.

  10135   Mon Jul 7 13:44:21 2014 JenneUpdateCDSc1sus - bad fb connection

Quote:

 

I managed to recover c1sus.  It required stopping all the models, and the restarting them one-by-one:

$ rtcds stop all     # <-- this does the right to stop all the models with the IOP stopped last, so they will all unload properly.

$ rtcds start iop

$ rtcds start c1sus c1mcs c1rfm

I have no idea why the c1sus models got wedged, or why restarting them in this way fixed the issue.

 In addition to needing obnoxiously regular mxstream restarts, this afternoon the sus machine was doing something slightly differently.  Only 1 fb block per core was red (the mxstream symptom is 3 fb-related blocks are red per core), and restarting the mxstream didn't help.  Anyhow, I was searching through the elog, and this entry to which I'm replying had similar symptoms.  However, by the time I went back to the CDS FE screen, c1sus had regular mxstream symptoms, and an mxstream restart fixed things right up. 

So, I don't know what the issue is or was, nor do I know why it is fixed, but it's fine for now, but I wanted to make a note for the future.

  10165   Wed Jul 9 17:08:29 2014 JenneUpdateCDSc1auxex "Unknown Host"

c1auxex has forgotten who it is.  Slow sliders for the QPD head were not responding, so I did a soft reboot from telnet. The machine didn't come back, so I plugged the RJ45-DB9 cable into the machine and looked at it through a minicom session.  When I key the crate, it gives me an error that it can't load a file, with the error code 0x320001.  Looking that up on a List of VxWorks error codes, I see that it is:    S_hostLib_UNKNOWN_HOST (3276801 or 0x320001)

I'm not sure how this happened.  I unplugged and replugged in the ethernet cable on the computer, but that didn't help.  Rana is going in to wiggle the other end of the ethernet cable, in case that's the problem.  EDIT:  Replacing the ethernet cable did not help.

Former elogs that are useful:  10025, 10015

EDIT:  The actual error message is:

boot device          : ei                                                      
processor number     : 0                                                       
host name            : chiara                                                  
file name            : /cvs/cds/vw/mv162-262-16M/vxWorks                       
inet on ethernet (e) : 192.168.113.59:ffffff00                                 
host inet (h)        : 192.168.113.104                                         
user (u)             : controls                                                
flags (f)            : 0x0                                                     
target name (tn)     : c1auxex                                                 
startup script (s)   : /cvs/cds/caltech/target/c1auxex/startup.cmd             
                                                                               
Attaching network interface ei0... done.                                       
Attaching network interface lo0... done.                                       
Loading...                                                                     
Error loading file: errno = 0x320001.                                          
Can't load boot file!! 

  10181   Fri Jul 11 08:11:56 2014 SteveUpdateCDSETMX needs help
  10188   Fri Jul 11 22:02:52 2014 Jamie, ChrisOmnistructureCDScdsutils: multifarious upgrades

To make the latest cdsutils available in the control room, we've done the following:

Upgrade pianosa to Ubuntu 12 (cdsutils depends on python2.7, not found in the previous release)

  • Enable distribution upgrades in the Ubuntu Software Center prefs
  • Check for updates in the Update Manager and click the big "Upgrade" button

Note that rossa remains on Ubuntu 10 for now.

Upgrade cdsutils to r260

  • Instructions here
  • cdsutils-238 was left as the default pointed to by the cdsutils symlink, for rossa's sake

Built and installed the nds2-client (a cdsutils dependency)

  • Checked out the source tree from svn into /ligo/svncommon/nds2
  • Built tags/nds2_client_0_10_5 (install instructions are here; build dependencies were installed by apt-get on chiara)
  • ./configure --prefix=/ligo/apps/ubuntu12/nds2-client-0.10.5; make; make install
  • In /ligo/apps/ubuntu12: ln -s nds2-client-0.10.5 nds2-client

nds2-client was apparently installed locally as a deb in the past, but the version in lscsoft seems broken currently (unknown symbols?). We should revisit this.

Built and installed pyepics (a cdsutils dependency)

  • Download link to ~/src on chiara
  • python setup.py build; python setup.py install --prefix=/ligo/apps/ubuntu12/pyepics-3.2.3
  • In /ligo/apps/ubuntu12: ln -s pyepics-3.2.3 pyepics

pyepics was also installed as deb before; should revisit when Jamie gets back.

Added the gqrx ppa and installed gnuradio (dependency of the waterfall plotter)

Added a test in /ligo/apps/ligoapps-user-env.sh to load the new cdsutils only on Ubuntu 12.

The end result:

controls@chiara|~ > z
usage: cdsutils  

Advanced LIGO Control Room Utilites

Available commands:

  read         read EPICS channel value
  write        write EPICS channel value
  switch       switch buttons in standard LIGO filter module
  avg          average one or more NDS channels for a specified amount of seconds
  servo        servos channel with a simple integrator (pole at zero)
  trigservo    servos channel with a simple integrator (pole at zero)
  audio        Play channel as audio stream.
  dv           Plot time series of channels from NDS.
  water        Live waterfall plotter for LIGO data

  version      print version info and exit
  help         this help

Add '-h' after individual commands for command help.
 
 
  10189   Fri Jul 11 22:28:34 2014 ChrisUpdateCDSc1auxex "Unknown Host"

Quote:

c1auxex has forgotten who it is.  Slow sliders for the QPD head were not responding, so I did a soft reboot from telnet. The machine didn't come back, so I plugged the RJ45-DB9 cable into the machine and looked at it through a minicom session.  When I key the crate, it gives me an error that it can't load a file, with the error code 0x320001.  Looking that up on a List of VxWorks error codes, I see that it is:    S_hostLib_UNKNOWN_HOST (3276801 or 0x320001)

I'm not sure how this happened.  I unplugged and replugged in the ethernet cable on the computer, but that didn't help.  Rana is going in to wiggle the other end of the ethernet cable, in case that's the problem.  EDIT:  Replacing the ethernet cable did not help.

Former elogs that are useful:  10025, 10015

EDIT:  The actual error message is:

boot device          : ei                                                      
processor number     : 0                                                       
host name            : chiara                                                  
file name            : /cvs/cds/vw/mv162-262-16M/vxWorks                       
inet on ethernet (e) : 192.168.113.59:ffffff00                                 
host inet (h)        : 192.168.113.104                                         
user (u)             : controls                                                
flags (f)            : 0x0                                                     
target name (tn)     : c1auxex                                                 
startup script (s)   : /cvs/cds/caltech/target/c1auxex/startup.cmd             
                                                                               
Attaching network interface ei0... done.                                       
Attaching network interface lo0... done.                                       
Loading...                                                                     
Error loading file: errno = 0x320001.                                          
Can't load boot file!! 

 We fixed this problem (at least for now) by adding c1auxex to the /etc/hosts file on chiara (following a hint from this page). The DNS setup might be the culprit here.

  10339   Wed Aug 6 13:17:21 2014 ericqOmnistructureCDScdsutils: multifarious upgrades

I've checked out cdsutils-274 to /opt/rtcds/cdsutils, and updated the /ligo/apps/ligoapps-user-env.sh to have the newer machines use it by default. This was to gain access to the cdsutils.Step methods for use in the smooth ASS handoffs script. 

  10426   Fri Aug 22 18:00:08 2014 jamieOmnistructureCDSubuntu12 awgstream installed

I installed awgstream-2.16.14 in /ligo/apps/ubuntu12.  As with all the ubuntu12 "packages", you need to source the ubuntu12 ligoapps environment script:

controls@pianosa|~ > . /ligo/apps/ubuntu12/ligoapps-user-env.sh
controls@pianosa|~ > which awgstream
/ligo/apps/ubuntu12/awgstream-2.16.14/bin/awgstream
controls@pianosa|~ > 

I tested it on the SRM LSC filter bank.  In one terminal I opened the following camonitor on C1:SUS-SRM_LSC_OUTMON.  In another terminal I ran the following:

controls@pianosa|~ > seq 0 .1 16384  | awgstream C1:SUS-SRM_LSC_EXC 16384 -
Channel = C1:SUS-SRM_LSC_EXC
File    = -
Scale   =          1.000000
Start   = 1092790384.000000
controls@pianosa|~ > 

The camonitor output was:

controls@pianosa|~ > camonitor C1:SUS-SRM_LSC_OUTMON
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:44:50.997418 0  
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:52:49.155525 218.8  
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:52:49.393404 628.4  
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:52:49.629822 935.6  
...
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:52:58.210810 15066.8  
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:52:58.489501 15476.4  
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:52:58.747095 15886  
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:52:59.011415 0 

In other words, it seems to work.

ELOG V3.1.3-