40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 46 of 335  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  9982   Wed May 21 13:18:47 2014 ericqUpdateCDSSuspension MEDM Bug

I fixed a bug in the SUS_SINGLE screen, where the total YAW output was incorrectly displayed (TO_COIL_3_1 instead of TO_COIL_1_3). I noticed this by seeing that the yaw bias slider had no effect on the number that claimed to be the yaw sum. The first time I did this, I accidently changed the screen size a bit which smushed things together, but that's fixed now.

I committed it to the svn, along with some uncommitted changed to the oplev servo screen.

  10010   Mon Jun 9 11:42:00 2014 JenneUpdateCDSComputer status

Current computer status:

All fast machines except c1iscey are up and running. I can't ssh to c1iscey, so I'll need to go down to the end station and have a look-see. On the c1lsc machine, neither the c1oaf nor the c1cal models are running (but for the oaf model, we know that this is because we need to revert the blrms block changes to some earlier version, see Jamie's elog 9911).

Daqd process is running on framebuilder.  However, when I try to open dataviewer, I get the popup error saying "Can't connect to rb", as well as an error in the terminal window that said something like "Error getting chan info".

Slow machines c1psl, c1auxex and c1auxey are not running (can't telnet to them, and white boxes on related medm screens for slow channels).  All other slow machines seem to be running, however nothing has been done to them to point them at the new location of the shared hard drive, so their status isn't ready to green-light yet.


Things that we did on Friday for the fast machines:

The shared hard drive is "physically" on Chiara, at /home/cds/.  Links are in place so that it looks like it's at the same place that it used to be:  /opt/rtcds/...... 

The first nameserver on all of the workstation machines inside of the file /etc/resolv.conf has been changed to be 192.168.113.104, which is Chiara's IP address (it used to be 192.168.113.20, which was linux1).  This change has also been made on the framebuilder, and in the framebuilder's /diskless/root/etc/resolv.conf file, which is what all of the fast front ends look to. 

On the framebuilder, and in the /diskless place for the fast front ends, presumably we must have changed something to point at the new location for the shared drive, but I don't remember how we did that [ERIC, what did we do???]


The slow front ends that we have tried changing have not worked out. 

First, we tried plugging a keyboard and monitor into c1auxey.  When we key the crate to reboot the machine, we get some error message about a "disk A drive error", but then it goes on to prompt pushing F1 for something, and F2 for entering setup.  No matter what we press, nothing happens.  c1auxey is still not running.

We were able to telnet into c1auxex, c1psl, and c1iool0.  On each of those machines, at the prompt, we used the command "bootChange".  This initially gives us a series of:

$ telnet c1susaux
Trying 192.168.113.55...
Connected to c1susaux.
Escape character is '^]'.

c1susaux > bootChange

'.' = clear field;  '-' = go to previous field;  ^D = quit

boot device          : ei
processor number     : 0
host name            : linux1
file name            : /cvs/cds/vw/mv162-262-16M/vxWorks
inet on ethernet (e) : 192.168.113.55:ffffff00
inet on backplane (b):
host inet (h)        : 192.168.113.20
gateway inet (g)     :
user (u)             : controls
ftp password (pw) (blank = use rsh):
flags (f)            : 0x0
target name (tn)     : c1susaux
startup script (s)   : /cvs/cds/caltech/target/c1susaux/startup.cmd
other (o)            :

value = 0 = 0x0
c1susaux >

If we go through that again (it comes up line-by-line, and you must press Enter to go to the next line) and put a period a the end of the Host Name line, and the Host Inet (h) line, they will come up blank the next time around.  So, the next time you run bootChange, you can type "chiara" for the host name, and "192.168.113.104" for the "host inet (h)".  If you run bootChange one more time, you'll see that the new things are in there, so that's good.

However, when we then try to reboot the computer, I think the machines weren't coming back after this point.  (Unfortunately, this is one of those things that I should have elogged back on Friday, since I don't remember precisely).  Certainly whatever the effect was, it wasn't what I wanted, and I left with the machines that I had tried rebooting, not running.

  10011   Mon Jun 9 12:19:17 2014 ericqUpdateCDSComputer status

Quote:

The first nameserver on all of the workstation machines inside of the file /etc/resolv.conf has been changed to be 192.168.113.104, which is Chiara's IP address (it used to be 192.168.113.20, which was linux1).  This change has also been made on the framebuilder, and in the framebuilder's /diskless/root/etc/resolv.conf file, which is what all of the fast front ends look to. 

On the framebuilder, and in the /diskless place for the fast front ends, presumably we must have changed something to point at the new location for the shared drive, but I don't remember how we did that [ERIC, what did we do???]

In all of the fstabs, we're using chiara's IP instead of name, so that if the nameserver part isn't working, we can still get the NFS mounts.

On control room computers, we mount the NFS through /etc/fstab having lines like:

192.168.113.104:/home/cds /cvs/cds nfs rw,bg 0 0
fb:/frames /frames nfs ro,bg 0 0

Then, things like /cvs/cds/foo are locally symlinked to /opt/foo

For the diskless machines, we edited the files in /diskless/root. On FB, /diskless/root/etc/fstab becomes

master:/diskless/root                   /         nfs     sync,hard,intr,rw,nolock,rsize=8192,wsize=8192    0 0
master:/usr                             /usr      nfs     sync,hard,intr,ro,nolock,rsize=8192,wsize=8192    0 0
master:/home                            /home     nfs     sync,hard,intr,rw,nolock,rsize=8192,wsize=8192    0 0
none                                    /proc     proc    defaults          0 0
none                                    /var/log        tmpfs   size=100m,rw    0 0
none                                    /var/lib/init.d tmpfs   size=100m,rw    0 0
none                                    /dev/pts        devpts  rw,nosuid,noexec,relatime,gid=5,mode=620        0 0
none                                    /sys            sysfs   defaults        0 0
master:/opt                             /opt      nfs    async,hard,intr,rw,nolock  0 0
192.168.113.104:/home/cds/rtcds         /opt/rtcds      nfs     nolock  0 0
192.168.113.104:/home/cds/rtapps        /opt/rtapps     nfs     nolock  0 0

("master" is defined in /diskless/root/etc/hosts to be 192.168.113.202, which is fb's IP)

and /diskless/root/etc/resolv.conf becomes:

search martian

nameserver 192.168.113.104 #Chiara

 

 

  10015   Mon Jun 9 22:26:44 2014 rana, zachUpdateCDSSLOW controls recovery

 All of the SLOW computers were in limbo since the fileserver/nameserver change, but me and Zach brought them back.

One of the troubles, was that we were unable to telnet into these computers once they failed to boot (due to not having a connection to their bootserver).

  1. Needed special DB9-RJ45 cable to connect from (old) laptop serial ports to the Motorola VME162 machines (e.g. c1psl, c1iool0, c1aux, etc.); thanks to Dave Barker for sending me the details on how to make these. Tara found 2 of these that Frank or PeterK had left there and saved us a huge hassle. Most new laptops don't have a serial port, but in principle there's a way to do this by using one of our USB-Serial adapters. We didn't try this, but just used an old laptop. The RJ45 connector must go into the top connector of the bottom 4; its labeled as 'console' on some of the VME computers. Thanks to K. Thorne for this very helpful hint and to Rolf for pointing me to KT.
  2. Installed 'minicom' on these machnes to allow communication via the serial port.
  3. Had to install RSH on chiara to allow the VME computers to connect to it. Also added the names of all the slow machines in /etc/hosts.equiv to allow for password-less login. Without this they were not able to load the vxWorks binary. It was tricky to get RSH to work, since its an insecure and deprecated service. 'rsh-server' doesn't work, but installing 'rsh-redone-server' did finally work for passwordless access. Must be that linux1 has RSH enabled, but of course, this was undocumented.
  4. Some of the SLOW machines didn't have their own target names or startup.cmd in their startup boot parameters (???). I fixed these.
  5. For C1VAC1, I have updated the boot parameters via bootChange, but I have not rebooted it. Waiting to do so when Koji and Steve are both around. We should make sure to not forget doing this on C1VAC2. Steve always tells us that it never works, but actually it does. It just crashes every so often.
  6. Leaving C1AUXEX and C1AUXEY for Q and Jacy to do, to see if this ELOG is good enough.
  7. The PSL crate still starts up with a SysFail light turned on red, but that doesn't seem to bother the c1psl operation. We (Steve) should go around and put a label on all the crates where SysFail is lit during 'normal' operation. Misleading warning lights are a bad thing.

We still don't have control completely of the MC Servo board, so we need the morning crew to start checking that out

An example session (using telnet, not the laptop/serial way) where we use bootChange to examine the correct c1aux config:

controls@pianosa|target> telnet c1aux
Trying 192.168.113.61...
Connected to c1aux.martian.
Escape character is '^]'.

c1aux > bootChange

'.' = clear field;  '-' = go to previous field;  ^D = quit

boot device          : ei
processor number     : 0
host name            : chiara
file name            : /cvs/cds/vw/mv162-262-16M/vxWorks
inet on ethernet (e) : 192.168.113.61:ffffff00
inet on backplane (b):
host inet (h)        : 192.168.113.104
gateway inet (g)     :
user (u)             : controls
ftp password (pw) (blank = use rsh):
flags (f)            : 0x0
target name (tn)     : c1aux
startup script (s)   : /cvs/cds/caltech/target/c1aux/startup.cmd
other (o)            :

value = 0 = 0x0
c1aux >

  10016   Mon Jun 9 22:40:36 2014 JenneUpdateCDSFast front end computers up

Rana and I now seem to have the fast front end computers (c1lsc, c1sus, c1ioo, c1iscex and c1iscey) up and running!  Hooray!

It seemed that we needed to change the soft links back to hard links for rtcds and rtapps on the front end machines.  On c1ioo, we did:

cd /opt

sudo rm -rf rtcds

sudo rm -rf rtapps

sudo mkdir rtcds

sudo mkdir rtapps

sudo chown controls:1001 rtcds

sudo chown controls:1001 rtapps

mount /opt/rtcds

mount /opt/rtapps

At this time, the front end fstab had several other options in addition to "nolock" for both rtcds and rtapps.  They had rw,bg,user,nolock.  This state still had some permissions problems.  (Later, we have decided that perhaps our next step was unneccesary, since it still left us with (fewer) permissions problems. Taking out the rw,bg,user options from the front end fstab seems to have fixed all permissions issues, so maybe this next chmod step didn't need to be done.  But it was done, so I record it for completeness).

On chiara, we did:

cd /home/cds/rtcds

sudo chmod -R 777 *

Then on c1iscex, I didn't have to deal with the soft links, but I did need to mount the rtcds and rtapps directories so that I could see files in them.  I just did the last 2 operations from the c1ioo list above (mount /opt/rtcds and mount /opt/rtapps). 

Since we were still seeing some (fewer) permissions problems, we took out the extra options in the front ends' fstab that Rana had added.  Rebooting c1iscex after this, everything came back as expected.  Nice!

I think that, at this point, remotely rebooting (sudo shutdown -r now) the other front ends made everything come back nicely. Since we had gotten the fstab situation correct, we didn't have to by-hand mount any directories, and all of the models restarted on their own.  Finally!

 

 


For posterity, here are things that we'll want to remember:

Frame builder's fstab, in /etc/fstab (only the uncommented lines, since there are lots of comments):

/dev/sdb1               /               ext3            noatime         0 1
/swapfile               none            swap            sw              0 0
shm                     /dev/shm        tmpfs           nodev,nosuid,noexec     0 0
/dev/sda1               /frames         ext3    noatime         0 0
192.168.113.104:/home/cds/                      /cvs/cds        nfs     _netdev,auto,rw,bg,soft      0 0
192.168.113.104:/home/cds/rtcds                  /opt/rtcds     nfs     _netdev,auto,rw,bg,soft 0 0
192.168.113.104:/home/cds/rtapps                 /opt/rtapps    nfs     _netdev,auto,rw,bg,soft 0 0

Fast front end fstabs, which are on the framebuilder in /diskless/root/etc/fstab:

master:/diskless/root                   /               nfs     sync,hard,intr,rw,nolock,rsize=8192,wsize=8192    0 0
master:/usr                             /usr            nfs     sync,hard,intr,ro,nolock,rsize=8192,wsize=8192    0 0
master:/home                            /home           nfs     sync,hard,intr,rw,nolock,rsize=8192,wsize=8192    0 0
none                                    /proc           proc    defaults          0 0
none                                    /var/log        tmpfs   size=100m,rw    0 0
none                                    /var/lib/init.d tmpfs   size=100m,rw    0 0
none                                    /dev/pts        devpts  rw,nosuid,noexec,relatime,gid=5,mode=620        0 0
none                                    /sys            sysfs   defaults        0 0
master:/opt                             /opt            nfs     async,hard,intr,rw,nolock  0 0
192.168.113.104:/home/cds/rtcds         /opt/rtcds      nfs     nolock                     0 0
192.168.113.104:/home/cds/rtapps        /opt/rtapps     nfs     nolock                     0 0

  10018   Tue Jun 10 09:25:29 2014 JamieUpdateCDSComputer status: should not be changing names

I really think it's a bad idea to be making all these names changes.  You're making things much much harder for yourselves.

Instead of repointing everything to a new host, you should have just changed the DNS to point the name "linux1" to the IP address of the new server.  That way you wouldn't need to reconfigure all of the clients.  That's the whole point of  name service: use a name so that you don't need to point to a number.

Also, pointing to an IP address for this stuff is not a good idea.  If the IP address of the server changes, everything will break again.

Just point everything to linux1, and make the DNS entries for linux1 point to the IP address of chiara.  You're doing all this work for nothing!

RXA: Of course, I understand what DNS means. I wanted to make the changes to the startup to remove any misconfigurations or spaghetti mount situations (of which we found many). The way the VME162 are designed, changing the name doesn't make the fix - it uses the number instead. And, of course, the main issue was not the DNS, but just that we had to setup RSH on the new machine. This is all detailed in the ELOG entries we've made, but it might be difficult to understand remotely if you are not familiar with the 40m CDS system.

  10025   Wed Jun 11 14:36:57 2014 JenneUpdateCDSSLOW controls recovery

 

 I have brought back c1auxex and c1auxey.  Hopefully this elog will have some more details to add to Rana's elog 10015, so that in the end, we have the whole process documented.

The old Dell computer was already in a Minicom session, so I didn't have to start that up - hopefully it's just as easy as opening the program.  (Edit, JCD, 9July2014:  Startup a terminal session, and then type "minicom" and press enter to get a Minicom session).

I plugged the DB9-RJ45 cable into the top of the RJ45 jacks on the computers.  Since the aux end station computers hadn't had their bootChanges done yet, the prompt was "VxWorks Boot".  For a computer that was already configured, for example the psl machine, the prompt was "c1psl", the name of the machine.  So, the indication that work needs to be done is either you get the Boot prompt, or the computer starts to hang while it's trying to load the operating system (since it's not where the computer expects it to be).  If the computer is hanging, key the crate again to power cycle it.  When it gets to the countdown that says "press any key to enter manual boot" or something like that, push some key.  This will get you to the "VxWorks Boot" prompt. 

Once you have this prompt, press "?" to get the boot help menu.  Press "p" to print the current boot parameters (the same list of things that you see with the bootChange command when you telnet in).  Press "c" to go line-by-line through the parameters with the option to change parameters.  I discovered that you can just type what you want the parameter to be next to the old value, and that will change the value.  (ex.  "host name   : linux1   chiara"   will change the host name from the old value of linux1 to the new value that you just typed of chiara). 

After changing the appropriate parameters (as with all the other slow computers, just the [host name] and the [host inet] parameters needed changing), key the crate one more time and let it boot.  It should boot successfully, and when it has finished and given you the name for the prompt (ex. c1auxex), you can just pull out the RJ45 end of the cable from the computer, and move on to the next one.

 

  10027   Wed Jun 11 15:57:18 2014 JenneUpdateCDSNote on cables for talking to slow computers

We have (now) in the lab 2 cables that are RJ45-DB9.  The gray one is LIGO-made, while the blue one is store-bought.  

The gray LIGO-made one works, but the blue store-bought one does not.  I checked their pinouts, and they are completely different.  On the sketch below, the pictures of the connectors is me looking at them face-on, with the cables going out the back of the page.  The DB9 is female. 

06111401.PDF

  10028   Wed Jun 11 16:01:31 2014 SteveUpdateCDSc1Vac1 and c1vac2 rebooted

Quote:

 

 I have brought back c1auxex and c1auxey.  Hopefully this elog will have some more details to add to Rana's elog 10015, so that in the end, we have the whole process documented.

The old Dell computer was already in a Minicom session, so I didn't have to start that up - hopefully it's just as easy as opening the program.

I plugged the DB9-RJ45 cable into the top of the RJ45 jacks on the computers.  Since the aux end station computers hadn't had their bootChanges done yet, the prompt was "VxWorks Boot" (or something like that).  For a computer that was already configured, for example the psl machine, the prompt was "c1psl", the name of the machine.  So, the indication that work needs to be done is either you get the Boot prompt, or the computer starts to hang while it's trying to load the operating system (since it's not where the computer expects it to be).  If the computer is hanging, key the crate again to power cycle it.  When it gets to the countdown that says "press any key to enter manual boot" or something like that, push some key.  This will get you to the "VxWorks Boot" prompt. 

Once you have this prompt, press "?" to get the boot help menu.  Press "p" to print the current boot parameters (the same list of things that you see with the bootChange command when you telnet in).  Press "c" to go line-by-line through the parameters with the option to change parameters.  I discovered that you can just type what you want the parameter to be next to the old value, and that will change the value.  (ex.  "host name   : linux1   chiara"   will change the host name from the old value of linux1 to the new value that you just typed of chiara). 

After changing the appropriate parameters (as with all the other slow computers, just the [host name] and the [host inet] parameters needed changing), key the crate one more time and let it boot.  It should boot successfully, and when it has finished and given you the name for the prompt (ex. c1auxex), you can just pull out the RJ45 end of the cable from the computer, and move on to the next one.

 

 Koji, Jenne and Steve

 

Preparation to reboot:

1, closed VA6, V5 disconnected cable to valves ( closed all annuloses )

2, closed V1, disconnected it and stopped Maglev rotation

3, closed V4, disconnected its cable

   See Atm1,  This set up is insured us so there can not be any accidental valve switching to vent the vacuum envelope if reboot-caos strikes.[moving=disconnected]

4, RESET c1Vac1 and c1Vac2 one by one and together. They both went at once. We did NOT power recycled.

    Jenne entered the new "carma" words on  the old Dell laptop and checked the good answers. The reboot was done.

    Note: c1Vac1 green-RUN indicator LED is yellow. It is fine as yellow.

5, Checked and TOGGLED valve positions to be correct value ( We did not correct the the small turbo pumps monitor positions, but they  were alive )

6,  V4 was reconnected and opened. Maglev was started.

7,  V1 cable reconnected and opened at full rotation speed of 560 Hz

8,  V5 cable reconnected,  valve opened..............VA6 cable connected and opened........

9,   Vacuum Normal valve configuration was reached.

 

Attachment 1: beforeReboot.png
beforeReboot.png
  10033   Thu Jun 12 15:31:47 2014 JamieUpdateCDSNote on cables for talking to slow computers

Quote:

We have (now) in the lab 2 cables that are RJ45-DB9.  The gray one is LIGO-made, while the blue one is store-bought.  

The gray LIGO-made one works, but the blue store-bought one does not.  I checked their pinouts, and they are completely different.  On the sketch below, the pictures of the connectors is me looking at them face-on, with the cables going out the back of the page.  The DB9 is female. 

 There are RJ45-DB9 adapters in the big spinny rack next to the linux1 rack that are for this exact purpose.  Just use a stanard ethernet cable with them.

  10040   Sun Jun 15 14:26:30 2014 JamieOmnistructureCDScdsutils re-installed

Quote:

 CDSUTILS is also gone from the path on all the workstations, so we need Jamie to tell us by ELOG how to set it up, or else we have to use ezcaread / ezcawrite forever.

It's in the elog already: http://nodus.ligo.caltech.edu:8080/40m/9922

But it seems like things still haven't fully recovered, or have recovered to an old state?  Why is the cdsutils install I previously did in /ligo/apps now missing?  It seems like other directories are missing as well.

There's also a user:group issue with the /home/cds mounts.  Everything in those mount points is owned nobody:nogroup.

I also can't log into pianosa and rosalba.

  10041   Sun Jun 15 14:41:08 2014 JamieOmnistructureCDScdsutils re-installed

Quote:

Quote:

 CDSUTILS is also gone from the path on all the workstations, so we need Jamie to tell us by ELOG how to set it up, or else we have to use ezcaread / ezcawrite forever.

It's in the elog already: http://nodus.ligo.caltech.edu:8080/40m/9922

But it seems like things still haven't fully recovered, or have recovered to an old state?  Why is the cdsutils install I previously did in /ligo/apps now missing?  It seems like other directories are missing as well.

There's also a user:group issue with the /home/cds mounts.  Everything in those mount points is owned nobody:nogroup.

I also can't log into pianosa and rosalba.

 I also still think it's a bad idea for everything to be mounting /home/cds from an IP address.  Just make a new DNS entry for linux1 and leave everything as it was.

  10057   Wed Jun 18 13:39:01 2014 ranaUpdateCDScdsutils reverted to version 238

 

 After some email consult with Jamie, I got cdsutils working again by reverting to v238. The newest versions are not compatible with our python 2.6 on Ubuntu 10. And our Ubuntu 12 machines do not have NDS2 clients that work yet.

The read/write commands work at the moment, but the NDS based ones don't yet work on pianosa due to some NDSSERVER flag/setup issue maybe, Jamie ??

controls@pianosa|~ > z

usage: cdsutils <cmd> <args>

Advanced LIGO Control Room Utilites

Available commands:

    read         read EPICS channel value

  write        write EPICS channel value

  switch       switch buttons in standard LIGO filter module

  avg          average one or more NDS channels for a specified amount of seconds

  servo        servos channel with a simple integrator (pole at zero)

  trigservo    servos channel with a simple integrator (pole at zero)


  version      print version info and exit

  help         this help


Add '-h' after individual commands for command help.

controls@pianosa|~ > z read C1:LSC-DARM_GAIN

0.0

controls@pianosa|~ 2> z avg 3 C1:IOO-MC_F

Error in write(): Connection refused

Error in write(): Connection refused

NDSSERVER variable incorrectly defined, or no NDS servers available.

controls@pianosa|~ 1> echo $NDSSERVER

fb

  10067   Wed Jun 18 22:47:48 2014 ericqUpdateCDSRaspberry pi added to martian network

I set up a raspberry pi on the martian network, to be hooked up to a frequency counter for tracking ALS beatnotes. 

The instructions at https://wiki-40m.ligo.caltech.edu/Martian_Host_Table are outdated, the name server configuration is now at /etc/bind/zones/martian.db, I need to remember to update the wiki soon. 

In any case, the raspberry pi is called "domenica," is found at 192.168.113.107, and has the standard controls user, with /cvs/cds mounted in the same way as the control room machines. 

Once I'm comfortable with the configuration of the pi, I'm going to take an image of the SD card that serves as its hard drive, so that we can just image new cards for future raspberry pis on the martian network if we ever want them. 

  10088   Mon Jun 23 20:58:38 2014 ericqUpdateCDSBootfest 2014!

This afternoon, I wanted to start the nominal alignment/adjustment steps for evening time locking, but got sucked into CDS frustrations. 

Primary symptom: TRX and TRY signals were not making it from C1:SUS-ETMX_TR[X,Y]_OUT to C1:LSC-TR[X,Y]_IN1. Various RFM bits were red on the CDS status page. 

 Secondary symptom: ITMX was randomly getting a good sized kick for no apparent reason. I still don't know what was behind this. 

First fix attempt: run sudo ntpdate -b -s -u pool.ntp.org on c1sus and c1lsc front ends, to see if NTP issues were responsible. No result.

Second fix attempt: Restart c1lsc, c1sus and c1rfm models. No change

Next fix attempt: Restart c1lsc and c1sus frontend machines. c1lsc models come back, c1sus models fail to sync / time out/ dmesg has some weird message about ADC channel hopping. At this point, c1ioo, c1iscey and c1iscex all have their models stop working due to sync problems. 

I then ran the above ntp command on all front ends and the FB, and restarted everyone's models (except c1lsc, who stayed working from here on out) which didn't change anything. I command-line rebooted all front ends (except c1lsc) and the FB (which had some dmesg messages about daqd segfaulting, but daqd issues weren't the problem). Still nothing. 

Finally, Koji came along and relieved me from my agony by hard rebooting all of the front ends; pulling out their power cables and seeing the life in their lights fade away... He did this first with the end station machines (c1iscey and c1iscex), and we saw them come back up perfectly happy, and then c1ioo and c1sus followed. At this point, all models came back; green RFM bits abounding, and TR[X,Y] signals propagating through as desired. 

Then, we tried turning the damping/watchdogs back on, which for some strange reason started shaking the hell out of everyone except the ETMs and ITMX. We restarted c1sus and c1mcs, and then damping worked again. Maybe a bad BURT restore was to blame?

At this point, all models were happy, all optics were damped, mode cleaner + WFS locked happily, but no beams were to be seen in the IFO 

The Yarm green would lock fine though, so tip-tilt alignment is probably to blame. I then left the interferometer to Jenne and Koji. 

  10109   Fri Jun 27 20:52:30 2014 KojiUpdateCDSOTTAVIA was not on network

I came in the lab. Found bunch of white EPICS boxes on ottavia.
It turned out that only ottavia was kicked out from the network.

After some struggle, I figured out that ottavia needs the ethernet cable unplugged / plugged
to connect (or reconnect) to the network.

For some unknown reason, ottavia was isolated from the martian network and couldn't come back.
This caused the MC autolocker frozen.

I logged in to megatron from ottavia, and ran at .../scritpt/MC

nohup ./AutoLockMC.csh &

Now the MC is happy.

  10135   Mon Jul 7 13:44:21 2014 JenneUpdateCDSc1sus - bad fb connection

Quote:

 

I managed to recover c1sus.  It required stopping all the models, and the restarting them one-by-one:

$ rtcds stop all     # <-- this does the right to stop all the models with the IOP stopped last, so they will all unload properly.

$ rtcds start iop

$ rtcds start c1sus c1mcs c1rfm

I have no idea why the c1sus models got wedged, or why restarting them in this way fixed the issue.

 In addition to needing obnoxiously regular mxstream restarts, this afternoon the sus machine was doing something slightly differently.  Only 1 fb block per core was red (the mxstream symptom is 3 fb-related blocks are red per core), and restarting the mxstream didn't help.  Anyhow, I was searching through the elog, and this entry to which I'm replying had similar symptoms.  However, by the time I went back to the CDS FE screen, c1sus had regular mxstream symptoms, and an mxstream restart fixed things right up. 

So, I don't know what the issue is or was, nor do I know why it is fixed, but it's fine for now, but I wanted to make a note for the future.

  10165   Wed Jul 9 17:08:29 2014 JenneUpdateCDSc1auxex "Unknown Host"

c1auxex has forgotten who it is.  Slow sliders for the QPD head were not responding, so I did a soft reboot from telnet. The machine didn't come back, so I plugged the RJ45-DB9 cable into the machine and looked at it through a minicom session.  When I key the crate, it gives me an error that it can't load a file, with the error code 0x320001.  Looking that up on a List of VxWorks error codes, I see that it is:    S_hostLib_UNKNOWN_HOST (3276801 or 0x320001)

I'm not sure how this happened.  I unplugged and replugged in the ethernet cable on the computer, but that didn't help.  Rana is going in to wiggle the other end of the ethernet cable, in case that's the problem.  EDIT:  Replacing the ethernet cable did not help.

Former elogs that are useful:  10025, 10015

EDIT:  The actual error message is:

boot device          : ei                                                      
processor number     : 0                                                       
host name            : chiara                                                  
file name            : /cvs/cds/vw/mv162-262-16M/vxWorks                       
inet on ethernet (e) : 192.168.113.59:ffffff00                                 
host inet (h)        : 192.168.113.104                                         
user (u)             : controls                                                
flags (f)            : 0x0                                                     
target name (tn)     : c1auxex                                                 
startup script (s)   : /cvs/cds/caltech/target/c1auxex/startup.cmd             
                                                                               
Attaching network interface ei0... done.                                       
Attaching network interface lo0... done.                                       
Loading...                                                                     
Error loading file: errno = 0x320001.                                          
Can't load boot file!! 

  10181   Fri Jul 11 08:11:56 2014 SteveUpdateCDSETMX needs help
Attachment 1: ETMX.png
ETMX.png
  10188   Fri Jul 11 22:02:52 2014 Jamie, ChrisOmnistructureCDScdsutils: multifarious upgrades

To make the latest cdsutils available in the control room, we've done the following:

Upgrade pianosa to Ubuntu 12 (cdsutils depends on python2.7, not found in the previous release)

  • Enable distribution upgrades in the Ubuntu Software Center prefs
  • Check for updates in the Update Manager and click the big "Upgrade" button

Note that rossa remains on Ubuntu 10 for now.

Upgrade cdsutils to r260

  • Instructions here
  • cdsutils-238 was left as the default pointed to by the cdsutils symlink, for rossa's sake

Built and installed the nds2-client (a cdsutils dependency)

  • Checked out the source tree from svn into /ligo/svncommon/nds2
  • Built tags/nds2_client_0_10_5 (install instructions are here; build dependencies were installed by apt-get on chiara)
  • ./configure --prefix=/ligo/apps/ubuntu12/nds2-client-0.10.5; make; make install
  • In /ligo/apps/ubuntu12: ln -s nds2-client-0.10.5 nds2-client

nds2-client was apparently installed locally as a deb in the past, but the version in lscsoft seems broken currently (unknown symbols?). We should revisit this.

Built and installed pyepics (a cdsutils dependency)

  • Download link to ~/src on chiara
  • python setup.py build; python setup.py install --prefix=/ligo/apps/ubuntu12/pyepics-3.2.3
  • In /ligo/apps/ubuntu12: ln -s pyepics-3.2.3 pyepics

pyepics was also installed as deb before; should revisit when Jamie gets back.

Added the gqrx ppa and installed gnuradio (dependency of the waterfall plotter)

Added a test in /ligo/apps/ligoapps-user-env.sh to load the new cdsutils only on Ubuntu 12.

The end result:

controls@chiara|~ > z
usage: cdsutils  

Advanced LIGO Control Room Utilites

Available commands:

  read         read EPICS channel value
  write        write EPICS channel value
  switch       switch buttons in standard LIGO filter module
  avg          average one or more NDS channels for a specified amount of seconds
  servo        servos channel with a simple integrator (pole at zero)
  trigservo    servos channel with a simple integrator (pole at zero)
  audio        Play channel as audio stream.
  dv           Plot time series of channels from NDS.
  water        Live waterfall plotter for LIGO data

  version      print version info and exit
  help         this help

Add '-h' after individual commands for command help.
 
 
  10189   Fri Jul 11 22:28:34 2014 ChrisUpdateCDSc1auxex "Unknown Host"

Quote:

c1auxex has forgotten who it is.  Slow sliders for the QPD head were not responding, so I did a soft reboot from telnet. The machine didn't come back, so I plugged the RJ45-DB9 cable into the machine and looked at it through a minicom session.  When I key the crate, it gives me an error that it can't load a file, with the error code 0x320001.  Looking that up on a List of VxWorks error codes, I see that it is:    S_hostLib_UNKNOWN_HOST (3276801 or 0x320001)

I'm not sure how this happened.  I unplugged and replugged in the ethernet cable on the computer, but that didn't help.  Rana is going in to wiggle the other end of the ethernet cable, in case that's the problem.  EDIT:  Replacing the ethernet cable did not help.

Former elogs that are useful:  10025, 10015

EDIT:  The actual error message is:

boot device          : ei                                                      
processor number     : 0                                                       
host name            : chiara                                                  
file name            : /cvs/cds/vw/mv162-262-16M/vxWorks                       
inet on ethernet (e) : 192.168.113.59:ffffff00                                 
host inet (h)        : 192.168.113.104                                         
user (u)             : controls                                                
flags (f)            : 0x0                                                     
target name (tn)     : c1auxex                                                 
startup script (s)   : /cvs/cds/caltech/target/c1auxex/startup.cmd             
                                                                               
Attaching network interface ei0... done.                                       
Attaching network interface lo0... done.                                       
Loading...                                                                     
Error loading file: errno = 0x320001.                                          
Can't load boot file!! 

 We fixed this problem (at least for now) by adding c1auxex to the /etc/hosts file on chiara (following a hint from this page). The DNS setup might be the culprit here.

  10339   Wed Aug 6 13:17:21 2014 ericqOmnistructureCDScdsutils: multifarious upgrades

I've checked out cdsutils-274 to /opt/rtcds/cdsutils, and updated the /ligo/apps/ligoapps-user-env.sh to have the newer machines use it by default. This was to gain access to the cdsutils.Step methods for use in the smooth ASS handoffs script. 

  10426   Fri Aug 22 18:00:08 2014 jamieOmnistructureCDSubuntu12 awgstream installed

I installed awgstream-2.16.14 in /ligo/apps/ubuntu12.  As with all the ubuntu12 "packages", you need to source the ubuntu12 ligoapps environment script:

controls@pianosa|~ > . /ligo/apps/ubuntu12/ligoapps-user-env.sh
controls@pianosa|~ > which awgstream
/ligo/apps/ubuntu12/awgstream-2.16.14/bin/awgstream
controls@pianosa|~ > 

I tested it on the SRM LSC filter bank.  In one terminal I opened the following camonitor on C1:SUS-SRM_LSC_OUTMON.  In another terminal I ran the following:

controls@pianosa|~ > seq 0 .1 16384  | awgstream C1:SUS-SRM_LSC_EXC 16384 -
Channel = C1:SUS-SRM_LSC_EXC
File    = -
Scale   =          1.000000
Start   = 1092790384.000000
controls@pianosa|~ > 

The camonitor output was:

controls@pianosa|~ > camonitor C1:SUS-SRM_LSC_OUTMON
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:44:50.997418 0  
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:52:49.155525 218.8  
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:52:49.393404 628.4  
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:52:49.629822 935.6  
...
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:52:58.210810 15066.8  
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:52:58.489501 15476.4  
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:52:58.747095 15886  
C1:SUS-SRM_LSC_OUTMON          2014-08-22 17:52:59.011415 0 

In other words, it seems to work.

  10480   Tue Sep 9 23:05:01 2014 KojiUpdateCDSOTTAVIA lost network connection

Today the network connection of OTTAVIA was sporadic.

Then in the evening OTTAVIA lost completely it. I tried jiggle the cables to recover it, but in vain.

We wonder if the network card (on-board one) has an issue.

  10483   Wed Sep 10 02:35:55 2014 ranaUpdateCDSOTTAVIA lost network connection

Quote:

Today the network connection of OTTAVIA was sporadic.

Then in the evening OTTAVIA lost completely it. I tried jiggle the cables to recover it, but in vain.

We wonder if the network card (on-board one) has an issue.

 I would also suspect IP conflicts; I had temporarily put the iMac on the Ottavia IP wire a few weeks ago. Hopefully its not back on there.

  10484   Wed Sep 10 02:52:05 2014 ericqUpdateCDSOTTAVIA lost network connection

I checked chiara's tables, all seemed fine. I switched ethernet cables from the black one labelled "allegra," which seemed maybe fragile, for the teal one that may have been chiara's old ethernet cable. It's back on the network now; hopefully it lasts. 

  10585   Wed Oct 8 15:31:31 2014 JenneUpdateCDSComputer status

After the Great Computer Meltdown of 2014, we forgot about poor c0rga, which is why the RGA hasn't been recording scans for the past several months (as Steve noted in elog 10548).

Q helped me remember how to fix it.  We added 3 lines to its /etc/fstab file, so that it knows to mount from Chiara and not Linux1.  We changed the resolv.conf file, and Q made some simlinks.

Steve and I ran ..../scripts/RGA/RGAset.py on c0rga to setup the RGA's settings after the power outage, and we're checking to make sure that the RGA will run right now, then we'll set it back to the usual daily 4am run via cron.

EDIT, JCD:  Ran ..../scripts/RGA/RGAlogger.py, saw that it works and logs data again.  Also, c0rga had a slightly off time, so I ran sudo ntpdate -b -s -u pool.ntp.org, and that fixed it.

Quote:

 

In all of the fstabs, we're using chiara's IP instead of name, so that if the nameserver part isn't working, we can still get the NFS mounts.

On control room computers, we mount the NFS through /etc/fstab having lines like:

192.168.113.104:/home/cds /cvs/cds nfs rw,bg 0 0
fb:/frames /frames nfs ro,bg 0 0

Then, things like /cvs/cds/foo are locally symlinked to /opt/foo

For the diskless machines, we edited the files in /diskless/root. On FB, /diskless/root/etc/fstab becomes

master:/diskless/root                   /         nfs     sync,hard,intr,rw,nolock,rsize=8192,wsize=8192    0 0
master:/usr                             /usr      nfs     sync,hard,intr,ro,nolock,rsize=8192,wsize=8192    0 0
master:/home                            /home     nfs     sync,hard,intr,rw,nolock,rsize=8192,wsize=8192    0 0
none                                    /proc     proc    defaults          0 0
none                                    /var/log        tmpfs   size=100m,rw    0 0
none                                    /var/lib/init.d tmpfs   size=100m,rw    0 0
none                                    /dev/pts        devpts  rw,nosuid,noexec,relatime,gid=5,mode=620        0 0
none                                    /sys            sysfs   defaults        0 0
master:/opt                             /opt      nfs    async,hard,intr,rw,nolock  0 0
192.168.113.104:/home/cds/rtcds         /opt/rtcds      nfs     nolock  0 0
192.168.113.104:/home/cds/rtapps        /opt/rtapps     nfs     nolock  0 0

("master" is defined in /diskless/root/etc/hosts to be 192.168.113.202, which is fb's IP)

and /diskless/root/etc/resolv.conf becomes:

search martian

nameserver 192.168.113.104 #Chiara

 

 

 

  10596   Fri Oct 10 14:27:44 2014 SteveUpdateCDSc1Vac1 and c1vac2 reboot was a failure

Quote:

Quote:

 

 I have brought back c1auxex and c1auxey.  Hopefully this elog will have some more details to add to Rana's elog 10015, so that in the end, we have the whole process documented.

The old Dell computer was already in a Minicom session, so I didn't have to start that up - hopefully it's just as easy as opening the program.

I plugged the DB9-RJ45 cable into the top of the RJ45 jacks on the computers.  Since the aux end station computers hadn't had their bootChanges done yet, the prompt was "VxWorks Boot" (or something like that).  For a computer that was already configured, for example the psl machine, the prompt was "c1psl", the name of the machine.  So, the indication that work needs to be done is either you get the Boot prompt, or the computer starts to hang while it's trying to load the operating system (since it's not where the computer expects it to be).  If the computer is hanging, key the crate again to power cycle it.  When it gets to the countdown that says "press any key to enter manual boot" or something like that, push some key.  This will get you to the "VxWorks Boot" prompt. 

Once you have this prompt, press "?" to get the boot help menu.  Press "p" to print the current boot parameters (the same list of things that you see with the bootChange command when you telnet in).  Press "c" to go line-by-line through the parameters with the option to change parameters.  I discovered that you can just type what you want the parameter to be next to the old value, and that will change the value.  (ex.  "host name   : linux1   chiara"   will change the host name from the old value of linux1 to the new value that you just typed of chiara). 

After changing the appropriate parameters (as with all the other slow computers, just the [host name] and the [host inet] parameters needed changing), key the crate one more time and let it boot.  It should boot successfully, and when it has finished and given you the name for the prompt (ex. c1auxex), you can just pull out the RJ45 end of the cable from the computer, and move on to the next one.

 

 Koji, Jenne and Steve

 

Preparation to reboot:

1, closed VA6, V5 disconnected cable to valves ( closed all annuloses )

2, closed V1, disconnected it and stopped Maglev rotation

3, closed V4, disconnected its cable

   See Atm1,  This set up is insured us so there can not be any accidental valve switching to vent the vacuum envelope if reboot-caos strikes.[moving=disconnected]

4, RESET c1Vac1 and c1Vac2 one by one and together. They both went at once. We did NOT power recycled.

    Jenne entered the new "carma" words on  the old Dell laptop and checked the good answers. The reboot was done.

    Note: c1Vac1 green-RUN indicator LED is yellow. It is fine as yellow.

5, Checked and TOGGLED valve positions to be correct value ( We did not correct the the small turbo pumps monitor positions, but they  were alive )

6,  V4 was reconnected and opened. Maglev was started.

7,  V1 cable reconnected and opened at full rotation speed of 560 Hz

8,  V5 cable reconnected,  valve opened..............VA6 cable connected and opened........

9,   Vacuum Normal valve configuration was reached.

 

 Yesterday's  reboot was prepared as stated above with one difference. 

  c1Vac1 and c1Vac2 were DOWN before reset. The disconnected valves stayed closed (plus VC1) . This saved us, so the main volume was not vented.

  All others OPENED. PR1 and PR2  rouphing pumps turned ON.  Ion pumps gate valve opened too. The ion pumps did not matter either because they were pump down recently.

  We'll have to rewrite how to reboot vacuum.

  

 

 

 

Attachment 1: c1vac1resetReboot.png
c1vac1resetReboot.png
Attachment 2: c1Vac1&2down.jpg
c1Vac1&2down.jpg
  10600   Mon Oct 13 16:08:49 2014 JenneUpdateCDSFrame builder is mad

I think the daqd process isn't running on the frame builder. 

Daqd_problem_maybe.png

I tried telnetting' to fb's port 8087 (telnet fb 8087) and typing "shutdown", but so far that is hanging and hasn't returned a prompt to me in the last few minutes.  Also, if I do a "ps -ef | grep daqd" in another terminal, it hangs. 

I wasn't sure if this was an ntp problem (although that has been indicated in the past by 1 red block, not 2 red blocks and a white one), so I did "sudo /etc/init.d/ntp-client restart", but that didn't make any change.  I also did an mxstream restart just in case, but that didn't help either. 

I can ssh to the frame builder, but I can't do another telnet (the first one is still hung).  I get an error "telnet: Unable to connect to remote host: Invalid argument"

Thoughts and suggestions are welcome!

  10601   Mon Oct 13 16:57:26 2014 KojiUpdateCDSFrame builder is mad

CPU load seems extremely high. You need to reboot it, I think

controls@fb /proc 0$ cat loadavg
36.85 30.52 22.66 1/163 19295

  10602   Mon Oct 13 17:09:38 2014 ericqUpdateCDSFrame builder is mad

 

This CPU load may have been me deleting some old frame files, to see if that would allow daqd to come back to life. 

Daqd was segfaulting, and behaving in a manner similar to what is described here: (stack exchange link). However, I couldn't kill or revive daqd, so I rebooted the FB. 

Things seem ok for now...

 

  10616   Thu Oct 16 03:18:48 2014 JenneUpdateCDSDaqd segfaulting again

 The daqd process on the frame builder looks like it is segfaulting again.  It restarts itself every few minutes.  

The symptoms remind me of elog 9530, but /frames is only 93% full, so the cause must be different.  

Did anyone do anything to the fb today?  If you did, please post an elog to help point us in a direction for diagnostics.

Q!!!!  Can you please help?  I looked at the log files, but they are kind of mysterious to me - I can't really tell the difference between a current (bad) log file and an old (presumably fine) log file.  (I looked at 3 or 4 random, old log files, and they're all different in some ways, so I don't know which errors and warnings are real, and which are to be ignored).

  10617   Thu Oct 16 12:22:43 2014 ericqUpdateCDSDaqd segfaulting again

I've been trying to figure out why daqd keeps crashing, but nothing is fixed yet. 

I commented out the line in /etc/inittab that runs daqd automatically, so I could run it manually. Each time I run it ( with ./daqd -c ./daqdrc while in c1/target/fb), it churns along fine for a little while, but eventually spits out something like:

[Thu Oct 16 12:07:23 2014] main profiler warning: 1 empty blocks in the buffer
[Thu Oct 16 12:07:24 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 12:07:25 2014] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1097521658 to 1097521660
Segmentation fault
 
Or:
 
[Thu Oct 16 11:43:54 2014] main profiler warning: 1 empty blocks in the buffer
[Thu Oct 16 11:43:55 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:56 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:57 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:58 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:59 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:44:00 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:44:01 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:44:02 2014] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1097520250 to 1097520257
FATAL: exception not rethrown
Aborted

I looked for time disagreements between the FB and the frontends, but they all seem fine. Running ntpdate only corrected things by 5ms. However, looking through /var/log/messages on FB, I found that ntp claims to have corrected the FB's time by ~111600 seconds (~31 hours) when I rebooted it on Monday.

Maybe this has something to do with the timing that the FB is getting? The FE IOPs seem happy with their sync status, but I'm not personally currently aware of how the FB timing is set up. 


Addendum:

On Monday, Jamie suggested checking out the situation with FB's RAID. Searching the elog for "empty blocks in the buffer" also brought up posts that mentioned problems with the RAID. 

I went to the JetStor RAID web interface at http://192.168.113.119, and it reports everything as healthy; no major errors in the log. Looking at the SMART status of a few of the drives shows nothing out of the ordinary. The RAID is not mounted in read-only mode either, as was the problem mentioned in previous elogs. 

  10623   Fri Oct 17 15:17:31 2014 jamieUpdateCDSDaqd "fixed"?

I very tentatively declare that this particular daqd crapfest is "resolved" after Jenne rebooted fb and daqd has been running for about 40 minutes now without crapping itself.  Wee hoo.

I spent a while yesterday trying to figure out what could have been going on.  I couldn't find anything.  I found an elog that said a previous daqd crapfest was finally only resolved by rebooting fb after a similar situation, i.e. there had been an issue that was resolved, daqd was still crapping itself, we couldn't figure out why so we just rebooted, daqd started working again.

So, in summary, totally unclear what the issue was, or why a reboot solved it, but there you go.

  10624   Fri Oct 17 16:54:11 2014 jamieUpdateCDSDaqd "fixed"?

Quote:

I very tentatively declare that this particular daqd crapfest is "resolved" after Jenne rebooted fb and daqd has been running for about 40 minutes now without crapping itself.  Wee hoo.

I spent a while yesterday trying to figure out what could have been going on.  I couldn't find anything.  I found an elog that said a previous daqd crapfest was finally only resolved by rebooting fb after a similar situation, i.e. there had been an issue that was resolved, daqd was still crapping itself, we couldn't figure out why so we just rebooted, daqd started working again.

So, in summary, totally unclear what the issue was, or why a reboot solved it, but there you go.

Looks like I spoke too soon.  daqd seems to be crapping itself again:

controls@fb /opt/rtcds/caltech/c1/target/fb 0$ ls -ltr logs/old/ | tail -n 20
-rw-r--r-- 1 4294967294 4294967294    11244 Oct 17 11:34 daqd.log.1413570846
-rw-r--r-- 1 4294967294 4294967294    11086 Oct 17 11:36 daqd.log.1413570988
-rw-r--r-- 1 4294967294 4294967294    11244 Oct 17 11:38 daqd.log.1413571087
-rw-r--r-- 1 4294967294 4294967294    13377 Oct 17 11:43 daqd.log.1413571386
-rw-r--r-- 1 4294967294 4294967294    11481 Oct 17 11:45 daqd.log.1413571519
-rw-r--r-- 1 4294967294 4294967294    11985 Oct 17 11:47 daqd.log.1413571655
-rw-r--r-- 1 4294967294 4294967294    13219 Oct 17 13:00 daqd.log.1413576037
-rw-r--r-- 1 4294967294 4294967294    11150 Oct 17 14:00 daqd.log.1413579614
-rw-r--r-- 1 4294967294 4294967294     5127 Oct 17 14:07 daqd.log.1413580231
-rw-r--r-- 1 4294967294 4294967294    11165 Oct 17 14:13 daqd.log.1413580397
-rw-r--r-- 1 4294967294 4294967294     5440 Oct 17 14:20 daqd.log.1413580845
-rw-r--r-- 1 4294967294 4294967294    11352 Oct 17 14:25 daqd.log.1413581103
-rw-r--r-- 1 4294967294 4294967294    11359 Oct 17 14:28 daqd.log.1413581311
-rw-r--r-- 1 4294967294 4294967294    11195 Oct 17 14:31 daqd.log.1413581470
-rw-r--r-- 1 4294967294 4294967294    10852 Oct 17 15:45 daqd.log.1413585932
-rw-r--r-- 1 4294967294 4294967294    12696 Oct 17 16:00 daqd.log.1413586831
-rw-r--r-- 1 4294967294 4294967294    11086 Oct 17 16:02 daqd.log.1413586924
-rw-r--r-- 1 4294967294 4294967294    11165 Oct 17 16:05 daqd.log.1413587101
-rw-r--r-- 1 4294967294 4294967294    11086 Oct 17 16:21 daqd.log.1413588108
-rw-r--r-- 1 4294967294 4294967294    11097 Oct 17 16:25 daqd.log.1413588301
controls@fb /opt/rtcds/caltech/c1/target/fb 0$

The times all indicate when the daqd log was rotated, which happens everytime the process restarts.  It doesn't seem to be happening so consistently, though.  It's been 30 minutes since the last one.  I wonder if it somehow correlated with actual interaction with the NDS process.  Does some sort of data request cause it to crash?

 

  10633   Thu Oct 23 01:39:34 2014 JenneUpdateCDSDaqd "fixed"?

Merging of threads. 

ChrisW figured out that it looks like the problem with the frame builder is that it's having to wait for disk access.  He has tweaked some things, and life has been soooo much better for Q and I this evening!  See Chris' elog at elog 10632.

In the last few hours we've had 2 or maybe 3 times that I've had to reconnect Dataviewer to the framebuilder, which is a significant improvement over having to do it every few minutes.

Also, Rossa is having trouble with DTT today, starting sometime around dinnertime.  Ottavia and Pianosa can do DTT things, but Rossa keeps getting "test timed out". 

  10642   Mon Oct 27 22:19:17 2014 JenneUpdateCDSc1iscex restarted

I'm not sure why, but c1iscex did not want to do an mxstream restart.  It would complain at me that  "* ERROR:  mx_stream is already stopping."
Koji suggested that I reboot the machine, so I did.  I turned off the ETMX watchdog, and then did a remote reboot.  Everything came back nicely, and the mx_stream process seems to be running.

  10715   Fri Nov 14 11:07:59 2014 manasaUpdateCDSNot able to run models on FE machines

Quote:

PRM, SRM and the ENDs are kicking up.  Computers are down.  PMC slider is stuck at low voltage.

Still not able to resolve the issue.

Except for c1lsc, the models are not running on any of the FE machines . I can ssh into all the machines but could not restart the models on the FE using the usual rtcds restart <modelname>

Something happened around 4AM (inferring from the Striptool on the wall) and the models have not been running since then.

 

Attachment 1: CDSissue.png
CDSissue.png
  10717   Fri Nov 14 15:45:34 2014 JenneUpdateCDSComputers back up after reboot

[Jenne, Q]

Everything seems to be back up and running.

The computers weren't such a big problem (or at least didn't seem to be).  I turned off the watchdogs, and remotely rebooted all of the computers (except for c1lsc, which Manasa already had gotten working).  After this, I also ssh-ed to c1lsc and restarted all of the models, since half of them froze or something while the other computers were being power cycled. 

However, this power cycling somehow completely screwed up the vertex suspensions.  The MC suspensions were fine, and SRM was fine, but the ITMs, BS and PRM were not damping.  To get them to kind of damp rather than ring up, we had to flip the signs on the pos and pit gains.  Also, we were a little suspicious of potential channel-hopping, since touching one optic was occasionally time-coincident with another optic ringing up.  So, no hard evidence on the channel hopping, but suspicions.

Anyhow, at some point I was concerned about the suspension slow computer, since the watchdogs weren't tripping even though the osem sensor rmses were well over the thresholds, so I keyed that crate.  After this, the watchdogs tripped as expected when we enabled damping but the RMS was higher than the threshold.

I eventually remotely rebooted c1sus again.  This totally fixed everything.  We put all of the local damping gains back to the values that we found them (in particular, undoing our sign flips), and everything seems good again.  I don't know what happened, but we're back online now. 

Q notes that the bounce mode for at least ITMX (haven't checked the others) is rung up.  We should check if it is starting to go down in a few hours.

Also, the FSS slow servo was not running, we restarted it on op340m.

  10756   Thu Dec 4 23:45:30 2014 JenneUpdateCDSFrozen?

[Jenne, Q, Diego]

I don't know why, but everything in EPICS-land froze for a few minutes just now.  It happened yesterday that I saw, but I was bad and didn't elog it.

Anyhow, the arms stayed locked (on IR) for the whole time it was frozen, so the fast things must have still been working.  We didn't see anything funny going on on the frame builder, although that shouldn't have much to do with the EPICS service.  The seismic rainbow on the wall went to zeros during the freeze, although the MC and PSL strip charts are still fine. 

After a few minutes, while we were still trying to think of things to check, things went back to normal.  We're going to just keep locking for now....

  10769   Tue Dec 9 03:41:06 2014 JenneUpdateCDSEPICS running slow - network issue?

[Jamie, EricQ, Jenne, Diego]

This is something that we discussed late Friday afternoon, but none of us remembered to elog. 

We have been noticing that EPICS seems to run pretty slowly, and in fact twice last week froze for ~2 minutes or so (elog 10756). 

On Friday, we plotted several traces on StripTool, such as C1:SUS-ETMY_QPD_SUM_OUTPUT and C1:SUS-ETMY_TRY_OUTPUT and C1:LSC-TRY_OUTPUT to see if channels with (supposedly) the same signal were seeing the same sample-holding.  They were.  The issue seems to be pretty wide spread, over all of the fast front ends.  However, if we look at EPICS channels provided by the old analog computers, they do not seem to have this issue. 

So, Jamie posits that perhaps we have a network switch somewhere that is connected to all of the fast front end computers, but not the old slow machines, and that this switch is starting to fail. 

My understanding is that the boys are on top of this, and are going to figure it out and fix it.

  10821   Fri Dec 19 18:08:46 2014 JenneUpdateCDSSOS!!! HELP!! EPICS freeze 45min+ so far!

[Jenne, Diego]

The EPICS freeze that we had noticed a few weeks ago (and several times since) has happened again, but this time it has not come back on its own.  It has been down for almost an hour so far. 

 So far, we have reset the Martian network's switch that is in the rack by the printer.  We have also power cycled the NAT router.  We have moved the NAT router from the old GC network switch to the new faster switch, and reset the Martian network's switch again after that.

We have reset the network switch that is in 1X6.

We have reset what we think is the DAQ network switch at the very top of 1X7.

So far, nothing is working.  EPICS is still frozen, we can't ping any computers from the control room, and new terminal windows won't give you the prompt (so perhaps we aren't able to mount the nfs, which is required for the bashrc).

We need help please!

  10822   Fri Dec 19 19:21:04 2014 diegoUpdateCDSSOS!!! HELP!! EPICS freeze 45min+ so far!

Quote:

[Jenne, Diego]

The EPICS freeze that we had noticed a few weeks ago (and several times since) has happened again, but this time it has not come back on its own.  It has been down for almost an hour so far. 

 So far, we have reset the Martian network's switch that is in the rack by the printer.  We have also power cycled the NAT router.  We have moved the NAT router from the old GC network switch to the new faster switch, and reset the Martian network's switch again after that.

We have reset the network switch that is in 1X6.

We have reset what we think is the DAQ network switch at the very top of 1X7.

So far, nothing is working.  EPICS is still frozen, we can't ping any computers from the control room, and new terminal windows won't give you the prompt (so perhaps we aren't able to mount the nfs, which is required for the bashrc).

We need help please!

[EricQ]

 

EricQ suggested it may be some NFS related issue: if something, maybe some computer in the control room, is asking too much to chiara, then all the other machines accessing chiara will slow down, and this could escalate and lead to the Big Bad Freeze. As a matter of fact, chiara's dmesg pointed out its eth0 interface being brought up constantly, as if something is making it go down repeatedly. Anyhow, after the shutdown of all the computers in the control room, a  reboot of chiara, megatron and the fb was performed.

 

[Diego]

Then I rebooted pianosa, and most of the issues seem gone so far; I had to "mxstream restart" all the frontends from medm and everyone of them but c1scy seems to behave properly. I will now bring the other machines back to life and see what happens next.

  10823   Fri Dec 19 20:32:11 2014 diegoUpdateCDSSOS!!! HELP!! EPICS freeze 45min+ so far!

[Diego, Jenne]

 

Everything seems reasonably back to normal:

Notes:

  • the machines in the control room have been rebooted;
  • the c1iscey frontend now behaves;
  • I saw on nodus, which remained up and running the whole time, a bunch of   nfs: server chiara is not responding, timed out  messages, belonging to the freezing time; it may be that the sync option for the nfs share is too resource demanding, or some other network issue;
  • the FSS was doing strange stuff and the MC couldn't recover the lock; the MCautolocker script wasn't running because of the lock loss of the MC and the lack of communication between the machines; so we did a sudo initctl start MCautolocker on megatron and recovered the MC too.
  10835   Tue Dec 23 14:27:16 2014 ericqUpdateCDSChiara moved to UPS

 Steve and I switched chiara over to the UPS we bought for it, after ensuring the vacuum system was in a safe state. Everything went without a hitch. 

Also, Diego and I have been working on getting some of the new computers up and running. Zita (the striptool projecting machine) has been replaced. One think pad laptop is missing an HD and battery, but the other one is fine. Diego has been working on a dell laptop, too. I was having problems editing the MAC address rules on the martian wifi router, but the working thinkpad's MAC was already listed. 

  10836   Tue Dec 23 14:30:11 2014 ericqUpdateCDSChiara moved to UPS

Quote:

 Steve and I switched chiara over to the UPS we bought for it, after ensuring the vacuum system was in a safe state. Everything went without a hitch. 

Also, Diego and I have been working on getting some of the new computers up and running. Zita (the striptool projecting machine) has been replaced. One think pad laptop is missing an HD and battery, but the other one is fine. Diego has been working on a dell laptop, too. I was having problems editing the MAC address rules on the martian wifi router, but the working thinkpad's MAC was already listed. 

 Turns out that, as the martian wifi router is quite old, it doesn't like Chrome; using Firefox worked like a charm and now also giada (the Dell laptop) is on 40MARS.

  10853   Mon Jan 5 18:15:18 2015 JenneUpdateCDSiscex reboot

Rana noted last week that TRX's value was stuck, not getting to the lsc from iscex.  I tried restarting the individual models scx, lsc and even scy (since scy had an extra red rfm light), to no avail.  I then did sudo shutdown -r now on iscex, and when it came back, the problem was gone.  Also, I then did a diag reset which cleared all of the unusual red rfm lights.

Things seem fine now, ready to lock all the things.

  10854   Mon Jan 5 20:17:26 2015 jamieConfigurationCDSGDS upgraded to 2.16.14

I upgraded the GDS and ROOT installations in /ligo/apps/ubuntu12 the control room workstations:

  • GDS 2.16.14
  • ROOT 5.34.18 (dependency of GDS)

My cursory tests indicate that they seem to be working:

2015-01-05-200025_1694x996_scrot.png

Now that the control room environment has become somewhat uniform at Ubuntu 12, I modified the /ligo/cdscfg/workstationrc.sh file to source the ubuntu12 configuration:

controls@nodus|apps > cat /ligo/cdscfg/workstationrc.sh
# CDS WORKSTATION ENVIRONMENT
source /ligo/apps/ligoapps-user-env.sh
source /ligo/apps/ubuntu12/ligoapps-user-env.sh
source /opt/rtcds/rtcds-user-env.sh
controls@nodus|apps > 

This should make all the newer versions available everywhere on login.

  10859   Tue Jan 6 17:41:20 2015 JenneConfigurationCDSDTT doesn't do envelopes??

[Jenne, Diego]

We are working on trying out the UGF servos, and wanted to take loop measurements with and without the servo to prove that it is working as expected.  However, it seems like new DTT is not following the envelopes that we are giving it. 

If we uncheck the "user" box, then it uses the amplitude that is given on the excitation tab.  But, if we check user and select envelope, the amplitude will always be whatever number is the first amplitude requested in the envelope.  If we change the first amplitude in the envelope, DTT will use that number for the new amplitude, so it is reading that file, but not doing the whole envelope thing correctly.

Thoughts?  Is this a bug in new DTT, or a pebkac issue?

  10890   Tue Jan 13 03:47:46 2015 ericqUpdateCDSInstalled new ethernet driver on Chiara

Chiara threw another network hissy fit. Dmesg was spammed with a bunch of messages like eth0: link up appearing rapidly. 

Some googling indicated that this error message in conjuction with the very ethernet board and driver that Chiara had in use could be solved by updating with an appropriate driver from the manufacturer.

In essence, I followed steps 1-7 from here: http://ubuntuforums.org/showthread.php?t=1661489

So far, so good. We'll keep an eye out to see how it works...

ELOG V3.1.3-