40m QIL Cryo_Lab CTN SUS_Lab CAML OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 99 of 357  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  8036   Fri Feb 8 12:43:26 2013 yutaUpdateComputersvideocapture.py now supports movie capturing

I updated /opt/rtcds/caltech/c1/scripts/general/videoscripts.py so that it supports movie capturing. It saves captured images (bmp) and movies (mp4) in /users/sensoray/SensorayCaptures/ directory.
I also updated /opt/rtcds/caltech/c1/scripts/pylibs/pyndslib.py because /usr/bin/lalapps_tconvert is not working and now /usr/bin/tconvert works.
However, tconvert doesn't run on ottavia, so I need Jamie to fix it.

videocapture.py -h:
Usage:
    videocapture.py [cameraname] [options]

Example usage:
    videocapture.py MC2F -s 320x240 -t off
       (Camptures image of MC2F with the size of 320x240, without timestamp on the image. MUST RUN ON PIANOSA!)
    videocapture.py AS -m 10
       (Camptures 10 sec movie of AS with the size of 720x480. MUST RUN ON PIANOSA!)


Options:
  -h, --help          show this help message and exit
  -s SIZE             specify image size [default: 720x480]
  -t TIMESTAMP_ONOFF  timestamp on or off [default: on]
  -m MOVLENGTH        specity movie length (in sec; takes movie if specified) [default: 0]

  8062   Mon Feb 11 18:44:34 2013 JamieUpdateComputerspasswerdz changed

Quote:

Be Prepared

http://xkcd.com/936/

Password for nodus and all control room workstations has been changed.  Look for new one in usual place.

We will try to change the password on all the RTS machines soon.  For the moment, though, they remain with the old passwerd.

  8088   Fri Feb 15 15:21:07 2013 JamieUpdateComputersc1iscex IO-chassis dead

I appears that the c1iscex IO-chassis is either dead or in a very bad state.  The PCIe interface card in the IO-chassis is showing four red lights, where it's supposed to be showing a dozen or so green lights.  Obviously this is going to prevent anything from running.

We've had power issues with this chassis before, so possibly that's what we're running into now.  I'll pull the chassis and diagnose asap.

 

  8140   Fri Feb 22 20:28:17 2013 JamieUpdateComputerslinux1 dead, then undead

At around 2:30pm today something brought down most of the martian network.  All control room workstations, nodus, etc. were unresponsive.  After poking around for a bit I finally figured it had to be linux1, which serves the NFS filesystem for all the important CDS stuff.  linux1 was indeed completely unresponsive.

Looking closer I noticed that the Fibrenetix FX-606-U4 SCSI hardware RAID device connected to linux1 (see #1901), which holds cds network filesystem, was showing "IDE Channel #4 Error Reading" on it's little LCD display.  I assumed this was the cause of the linux1 crash.

I hard shutdown linux1, and powered off the Fibrenetix device.  I pulled the disk from slot 4 and replaced it with one of the spares we had in the control room cabinets.  I powered the device back up and it beeped for a while.  Unfortunately the device was requiring a password to access it from the front panel, and I could find no manual for the device in the lab, nor does the manufacturer offer the manual on it's web site.

Eventually I was able to get linux1 fully rebooted (after some fscks) and it seemed to mount the hardware RAID (as /dev/sdc1) fine.  The brought the NFS back.  I had to reboot nodus to get it recovered, but all the control room and front-end linux machines seemed to recover on their own (although the front-ends did need an mxstream restart).

The remaining problem is that the linux1 hardware RAID device is still currently unaccessible, and it's not clear to me that it's actually synced the new disk that I put in it.  In other words I have very little confidence that we actually have an operational RAID for /opt/rtcds.  I've contacted the LDAS guys (ie. Dan Kozak) who are managing the 40m backup to confirm that the backup is legit.  In the mean time I'm going to spec out some replacement disks onto which to copy /opt/rtcds, and also so that we can get rid of this old SCSI RAID thing.

Attachment 1: FX-606-U4_1205.pdf
FX-606-U4_1205.pdf
  8141   Sat Feb 23 00:34:28 2013 yutaUpdateComputerscrontab in op340m deleted and restored (maybe)

I accidentally overwrote crontab in op340m with an empty file.
By checking /var/cron in op340m, I think I restored it.

But somehow, autolockMCmain40m does not work in cron job, so it is currently running by nohup.

What I did:
  1. I ssh-ed op340m to edit crontab to change MC autolocker to usual power mode. I used "crontab -e", but it did not show anything. I exited emacs and op340m.
  2. Rana found that the file size of crontab went 0 when I did "crontab -e".
  3. I found my elog #6899 and added one line to crontab

55 * * * *  /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/MC/autolockMCmain40m >/cvs/cds/caltech/logs/scripts/mclock.cronlog 2>&1

  4. It didn't run correctly, so Rana used his hidden power "nohup" to run autolockMCmain40m in background.
  5. Koji's hidden magic "/var/cron/log" gave me inspiration about what was in crontab. So, I made a new crontab in op340m like this;

34 * * * *  /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/MC/autolockMCmain40m >/cvs/cds/caltech/logs/scripts/mclock.cronlog 2>&1
55 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/RCthermalPID.pl >/cvs/cds/caltech/logs/scripts/RCthermalPID.cronlog 2>&1
07 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/FSSSlowServo >/cvs/cds/caltech/logs/scripts/FSSslow.cronlog 2>&1
00 * * * * /opt/rtcds/caltech/c1/burt/autoburt/burt.cron >> /opt/rtcds/caltech/c1/burt/burtcron.log
13 * * * * /cvs/cds/caltech/conlog/bin/check_conlogger_and_restart_if_dead
14,44 * * * * /opt/rtcds/caltech/c1/scripts/SUS/rampdown.pl > /dev/null 2>&1


  6. It looks like some of them started running, but I haven't checked if they are working or not. We need to look into them.

Moral of the story:
  crontab needs backup.

  8144   Sat Feb 23 14:04:07 2013 KojiUpdateComputersapache retarted (Re: linux1 dead, then undead)

apache has been restarted.
How to: search "apache" on the 40m wiki

Quote:

I had to reboot nodus to get it recovered

 

  8146   Sat Feb 23 15:26:26 2013 yutaUpdateComputerscrontab in op340m updated

I found some daily cron jobs for op340m I missed last night. Also, I edited timings of hourly jobs to maintain consistency with the past. Some of them looks old, but I will leave as it is for now.
At least, burt, FSSSlowServo and autolockMCmain40m seems like they are working now.
If you notice something is missing, please add it to crontab.

07 * * * * /opt/rtcds/caltech/c1/burt/autoburt/burt.cron >> /opt/rtcds/caltech/c1/burt/burtcron.log
13 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/FSSSlowServo >/cvs/cds/caltech/logs/scripts/FSSslow.cronlog 2>&1
14,44 * * * * /cvs/cds/caltech/conlog/bin/check_conlogger_and_restart_if_dead
15,45 * * * * /opt/rtcds/caltech/c1/scripts/SUS/rampdown.pl > /dev/null 2>&1
55 * * * *  /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/MC/autolockMCmain40m >/cvs/cds/caltech/logs/scripts/mclock.cronlog 2>&1
59 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/RCthermalPID.pl >/cvs/cds/caltech/logs/scripts/RCthermalPID.cronlog 2>&1

00 0 * * * /var/scripts/ntp.sh > /dev/null 2>&1
00 4 * * * /opt/rtcds/caltech/c1/scripts/RGA/RGAlogger.cron >> /cvs/cds/caltech/users/rward/RGA/RGAcron.out 2>&1
00 6 * * * /cvs/cds/scripts/backupScripts.pl
00 7 * * * /opt/rtcds/caltech/c1/scripts/AutoUpdate/update_conlog.cron

  8147   Sat Feb 23 15:46:16 2013 ranaUpdateComputerscrontab in op340m updated

According to Google, you can add a line in the crontab to backup the crontab by having the cronback.py script be in the scripts/ directory. It needs to save multiple copies, or else when someone makes the file size zero it will just write a zero size file onto the old backup.

  8181   Wed Feb 27 11:22:54 2013 yutaUpdateComputersbackup crontab

I made a simple script to backup crontab (/opt/rtcds/caltech/c1/scripts/crontab/backupCrontab).

#!/bin/bash

crontab -l > /opt/rtcds/caltech/c1/scripts/crontab/crontab_$(hostname).$(date '+%Y%m%d%H%M%S')


I put this script into op340m crontab.

00 8 * * * /opt/rtcds/caltech/c1/scripts/crontab/backupCrontab

It took me 30 minutes to write and check this one line script. I hate shell scripts.

  8266   Mon Mar 11 10:20:36 2013 Max HortonSummaryComputersAttempted Smart UPS 2200 Battery Replacement

Attempted Battery Replacement on Backup Power Supply in the Control Room:

I tried to replace the batteries in the Smart UPS 2200 with new batteries purchased by Steve.  However, the power port wasn't compatible with the batteries.  The battery cable's plug was too tall to fit properly into the Smart UPS port.  New batteries must be acquired.  Steve has pictures of the original battery (gray) and the new battery (blue) plugs, which look quite different (even though the company said the battery would fit).

The Correct battery connector is GRAY : APC RBC55

Attachment 1: upsB.jpg
upsB.jpg
Attachment 2: upsBa.jpg
upsBa.jpg
  8274   Tue Mar 12 00:35:56 2013 JenneUpdateComputersFB's RAID is beeping

[Manasa, Jenne]

Manasa just went inside to recenter the AS beam on the camera after our Yarm spot centering exercises of the evening, and heard a loud beeping.  We determined that it is the RAID attached to the framebuilder, which holds all of our frame data that is beeping incessantly.  The top center power switch on the back (there are FOUR power switches, and 3 power cables, btw.  That's a lot) had a red light next to it, so I power cycled the box.  After the box came back up, it started beeping again, with the same front panel message:

H/W monitor power #1 failed.

Right now the fb is trying to stay connected to things, and we can kind of use dataviewer, but we lose our connection to the framebuilder every ~30 seconds or so.  This rough timing estimate comes from how often we see the fb-related lights on the frontend status screen cycle from green to white to red back to green (or, how long do the lights stay green before going white again).  We weren't having trouble before the RAID went down a few minutes ago, so I'm hopeful that once that's fixed, the fb will be fine. 

In other news, just to make Jamie's day a little bit better, Dataviewer does not open on Pianosa or Rosalba.  The window opens, but it stays a blank grey box.  This has been going on for Pianosa for a few days, but it's new (to me at least) on Rosalba.  This is different from the lack of ability to connect to the fb that Rossa and Ottavia are seeing.

  8278   Tue Mar 12 12:06:22 2013 JamieUpdateComputersFB recovered, RAID power supply #1 dead

The framebuilder RAID is back online.  The disk had been mounted read-only (see below) so daqd couldn't write frames, which was in turn causing it to segfault immediately, so it was constantly restarting.

The jetstor RAID unit itself has a dead power supply.  This is not fatal, since it has three.  It has three so it can continue to function if one fails.  I have removed the bad supply and gave it to Steve so he can get a suitable replacement.

Some recovery had to be done on fb to get everything back up and running again.  I ran into issues trying to do it on the fly, so I eventually just rebooted.  It seemed to come back ok, except for something going on with daqd.  It was reporting the following error upon restart:

[Tue Mar 12 11:43:54 2013] main profiler warning: 0 empty blocks in the buffer

It was spitting out this message about once a second, until eventually the daqd died.  When it restarted it seemed to come back up fine.  I'm not exactly clear what those messages were about, but I think it has something to do with not being able to dump it's data buffers to disk.  I'm guessing that this was a residual problem from the umounted /frames, which somehow cleared on it's own.  Everything seems to be ok now.

Quote:

Manasa just went inside to recenter the AS beam on the camera after our Yarm spot centering exercises of the evening, and heard a loud beeping.  We determined that it is the RAID attached to the framebuilder, which holds all of our frame data that is beeping incessantly.  The top center power switch on the back (there are FOUR power switches, and 3 power cables, btw.  That's a lot) had a red light next to it, so I power cycled the box.  After the box came back up, it started beeping again, with the same front panel message:

H/W monitor power #1 failed.

DO NOT DO THIS.  This is what caused all the problems.  The unit has three redundant power supplies, for just this reason.  It was probably continuing to function fine.  The beeping was just to tell you that there was something that needed attention.  Rebooting the device does nothing to solve the problem.  Rebooting in an attempt to silence beeping is not a solution.  Shutting of the RAID unit is basically the equivalent of ripping out a mounted external USB drive.  You can damage the filesystem that way.  The disk was still functioning properly.  As far as I understand it the only problem was the beeping, and there were no other issues.  After you hard rebooted the device, fb lost it's mounted disk and then went into emergency mode, which was to remount the disk read-only.  It didn't understand what was going on, only that the disk seemed to disappear and the reappear.  This was then what caused the problems.  It was not the beeping, it was the restarting the RAID that was mounted on fb.

Computers are not like regular pieces of hardware.  You can't just yank the power on them.  Worse yet is yanking the power on a device that is connected to a computer.  DON"T DO THIS UNLESS YOU KNOW WHAT YOU"RE DOING.  If the device is a disk drive, then doing this is a sure-fire way to damage data on disk.

 

  8280   Tue Mar 12 14:51:00 2013 SteveUpdateComputersbuy warranty or not ?

 Details of the warranties are posted on wiki power supply cost, warranty described, cost

.......I’ve also attached a warranty renewal quote.  A 1 year warranty renewal is usually $.... per year, but we gave you special pricing of $.... / year if you renew both units.  This pricing is also special due to the fact that both warranties expired awhile ago.  We usually require that the warranty renewal begin on the date of expiration, but we will waive this for you this time if both are renewed.

 

JetStor SATA 416S, SN: SB09040111A3 – expired 04/24/2012 (3 years old)

 

JetStor SATA 516F, SN: SB09080016P – expired on 08/21/2012........

 

. Are we keep it for an other 2 years? buy warranty or buy better storage.

 

  8324   Thu Mar 21 10:29:12 2013 ManasaUpdateComputersComputers down since last night

I'm trying to figure out what went wrong last night. But the morning status...the computers are down.

 

down.png

Attachment 1: down.png
down.png
  8325   Thu Mar 21 12:04:05 2013 ManasaUpdateComputersFixed

All FE computers are back.

Restart procedure:

0a. Restart frame builder: telnet fb 8087 & type shutdown

0b. Restart mx_stream from the FE overview screen

1. I ssh ed to the computer. (c1lsc, c1ioo, c1iscex, c1isey)

2. I used 'sudo shutdown -r (computername)'. They came back ON.

3. While rebooting c1ioo, c1sus shutdown (for reasons I don't know). I could not ping or ssh c1SUS after this.

4. I went in and switched c1SUS computer OFF and back ON after which I could ssh to it.

5. I did the same reboot procedure for c1SUS.

6. I had to restart some of the models individually.
    (i) ssh to the computer running the model
    (ii) rtcds restart 'model name'

7. All computers are back now.

up.png

  8326   Thu Mar 21 12:33:51 2013 ranaUpdateComputersFixed

Please stop power cycling computers. This is not an acceptable operation (as Jamie already wrote before). When you don't know what to do besides power cycling the computer, just stop and do something else or call someone who knows more. Every time you kill the power to a computer you are taking a chance on damaging it or corrupting some hard drive.

In this case, the right thing to do would be to hook up the external keyboard and monitor directly to c1sus to diagnose things.

NO MORE TOUCHING THE POWER BUTTON.

  8334   Mon Mar 25 09:52:22 2013 JenneUpdateComputersc1lsc mxstream won't restart

Most of the front ends' mx streams weren't running, so I did the old mxstreamrestart on all machines (see elog 6574....the dmesg on c1lsc right now, at the top, has similar messages).  Usually this mxstream restart works flawlessly, but today c1lsc isn't working.  Usually to the right side of the terminal window I get an [ok] when things work.  For the lsc machine today, I get [!!] instead. 

After having learned from recent lessons, I'm waiting to hear from Jamie.

  8335   Mon Mar 25 11:42:45 2013 JamieUpdateComputersc1lsc mx_stream ok

I'm not exactly sure what the problem was here, but I think it had to do with a stuck mx_stream process that wasn't being killed properly.  I manually killed the process and it seemed to come up fine after that.  The regular restart mechanisms should work now.

No idea what caused the process to hang in the first place, although I know the newer RCG (2.6) is supposed to address some of these mx_stream issues.

  8366   Thu Mar 28 10:44:30 2013 ManasaUpdateComputersc1lsc down

c1lsc was down this morning.

I restarted fb and c1lsc based on elog

Everything but c1oaf came back. I tried to restart c1oaf individually; but it didn't work.

Before:

cds_FE.png

After:

cds_fe1.png

  8367   Thu Mar 28 12:50:52 2013 JenneUpdateComputersc1lsc is fine

 Manasa told me that she did things in a different order than her old elog. 

She had

(1) ssh'ed to c1lsc and did a remote shutdown / restart,

(2) restarted fb,

(3) restarted the mxstream on c1lsc,

(4) restarted each model individually in some order that I forgot to ask.

However, with the situation as in her "before" screenshot, all that needed to be done was restart the mxstream process on c1lsc. 

Anyhow, when I looked at the OAF model, it was complaining of "no sync", so I restarted the model, and it came back up fine.  All is well again.

  8374   Fri Mar 29 17:24:43 2013 JamieUpdateComputersFB RAID power supply replaced

Steve ordered a replacement power supply for the FB JetStor power supply that failed a couple weeks ago.  I just installed it and it looks fine.

  8394   Tue Apr 2 20:52:35 2013 ranaUpdateComputersiMac bashed

 I changed the default shell on our control room iMac to bash. Since we're really, really using bash as the shell for LIGO, we might as well get used to it. As we do this for the workstations, some things will fail, but we can adopt Jamie's private .bashrc to get started and then fix it up later.

  8398   Wed Apr 3 01:32:04 2013 JenneUpdateComputersupdated EPICS database (channels selected for saving)

I modified /opt/rtcds/caltech/c1/chans/daq/C0EDCU.ini to include the C1:LSC-DegreeOfFreedom_TRIG_MON channels.  These are the same channel that cause the LSC screen trigger indicators to light up. 

I vaguely followed Koji's directions in elog 5991, although I didn't add new grecords, since these channels are already included in the .db file as a result of EpicsOut blocks in the simulink model.  So really, I only did Step 2.  I still need to restart the framebuilder, but locking (attempt at locking) is happening.

The idea here is that we should be able to search through this channel, and when we get a trigger, we can go back and plot useful signals (PDs, error signals, cotrol signals,....), and try to figure out why we're losing lock. 

Rana tells me that this is similar to an old LockAcq script that would run DTT and get data.

EDIT:  I restarted the daqd on the fb, and I now see the channel in dataviewer, but I can only get live data, no past data, even though it says that it is (16,float).  Here's what Dataviewer is telling me:

Connecting to NDS Server fb (TCP port 8088)
Connecting.... done
read(); errno=0
LONG: DataRead = -1
No data found

read(); errno=9
read(); errno=9
T0=13-03-29-08-59-43; Length=432010 (s)
No data output.

 

  8400   Wed Apr 3 14:45:34 2013 JamieUpdateComputersupdated EPICS database (channels selected for saving)

Quote:

I modified /opt/rtcds/caltech/c1/chans/daq/C0EDCU.ini to include the C1:LSC-DegreeOfFreedom_TRIG_MON channels.  These are the same channel that cause the LSC screen trigger indicators to light up. 

I vaguely followed Koji's directions in elog 5991, although I didn't add new grecords, since these channels are already included in the .db file as a result of EpicsOut blocks in the simulink model.  So really, I only did Step 2.  I still need to restart the framebuilder, but locking (attempt at locking) is happening.

The idea here is that we should be able to search through this channel, and when we get a trigger, we can go back and plot useful signals (PDs, error signals, cotrol signals,....), and try to figure out why we're losing lock. 

Rana tells me that this is similar to an old LockAcq script that would run DTT and get data.

EDIT:  I restarted the daqd on the fb, and I now see the channel in dataviewer, but I can only get live data, no past data, even though it says that it is (16,float).  Here's what Dataviewer is telling me:

Connecting to NDS Server fb (TCP port 8088)
Connecting.... done
read(); errno=0
LONG: DataRead = -1
No data found

read(); errno=9
read(); errno=9
T0=13-03-29-08-59-43; Length=432010 (s)
No data output.
 

I seem to be able to retrieve these channels ok from the past:

controls@pianosa:/opt/rtcds/caltech/c1/scripts 0$ tconvert 1049050000
Apr 03 2013 18:46:24 UTC
controls@pianosa:/opt/rtcds/caltech/c1/scripts 0$ ./general/getdata -s 1049050000 -d 10 --noplot C1:LSC-PRCL_TRIG_MON
Connecting to server fb:8088 ...
nds_logging_init: Entrynds_logging_init: Exit
fetching... 1049050000.0
Hit any key to exit: 
controls@pianosa:/opt/rtcds/caltech/c1/scripts 0$ 

Maybe DTT just needed to be reloaded/restarted?

  8444   Thu Apr 11 11:58:21 2013 JenneUpdateComputersLSC whitening c-code ready

The big hold-up with getting the LSC whitening triggering ready has been a problem with running the c-code on the front end models.  That problem has now been solved (Thanks Alex!), so I can move forward.

The background:

We want the RFPD whitening filters to be OFF while in acquisition mode, but after we lock, we want to turn the analog whitening (and the digital compensation) ON.  The difference between this and the other DoF and filter module triggers is that we must parse the input matrix to see which PD is being used for locking at that time.  It is the c-code that parses this matrix that has been causing trouble.  I have been testing this code on the c1tst.mdl, which runs on the Y-end computer.  Every time I tried to compile and run the c1tst model, the entire Y-end computer would crash.

The solution:

Alex came over to look at things with Jamie and me.  In the 2.5 version of the RCG (which we are still using), there is an optimization flag "-O3" in the make file.  This optimization, while it can make models run a little faster, has been known in the past to cause problems.  Here at the 40m, our make files had an if-statement, so that the c1pem model would compile using the "-O" optimization flag instead, so clearly we had seen the problem here before, probably when Masha was here and running the neural network code on the pem model.  In the RCG 2.6 release, all models are compiled using the "-O" flag.  We tried compiling the c1tst model with this "-O" optimization, and the model started and the computer is just fine.  This solved the problem.

Since we are going to upgrade to RCG 2.6 in the near-ish future anyway, Alex changed our make files so that all models will now compile with the "-O" flagWe should monitor other models when we recompile them, to make sure none of them start running long with the different optimization. 

The future:

Implement LSC whitening triggering!

  8479   Tue Apr 23 22:10:54 2013 ranaUpdateComputersNancy

controls@rosalba:/users/rana/docs 0$ svn resolve --accept working nancy
Resolved conflicted state of 'nancy'

  8529   Sat May 4 00:21:00 2013 ranaConfigurationComputersworkstation updates

 Koji and I went into "Update Manager" on several of the Ubuntu workstations and unselected the "check for updates" button. This is to prevent the machines from asking to be upgraded so frequently - I am concerned that someone might be tempted to upgrade the workstations to Ubuntu 12.

We didn't catch them all, so please take a moment to check that this is the case on all the laptops you are using and make it so. We can then apply the updates in a controlled manner once every few months.

  8540   Tue May 7 17:43:51 2013 JamieUpdateComputers40MARS wireless network problems

I'm not sure what's going on today but we're seeing ~80% packet loss on the 40MARS wireless network.  This is obviously causing big problems for all of our wirelessly connected machines.  The wired network seems to be fine.

I've tried power cycling the wireless router but it didn't seem to help.  Not sure what's going on, or how it got this way.  Investigating...

  8541   Tue May 7 18:16:37 2013 JamieUpdateComputers40MARS wireless network problems

Here's an example of the total horribleness of what's happening right now:

controls@rossa:~ 0$ ping 192.168.113.222
PING 192.168.113.222 (192.168.113.222) 56(84) bytes of data.
From 192.168.113.215 icmp_seq=2 Destination Host Unreachable
From 192.168.113.215 icmp_seq=3 Destination Host Unreachable
From 192.168.113.215 icmp_seq=4 Destination Host Unreachable
From 192.168.113.215 icmp_seq=5 Destination Host Unreachable
From 192.168.113.215 icmp_seq=6 Destination Host Unreachable
From 192.168.113.215 icmp_seq=7 Destination Host Unreachable
From 192.168.113.215 icmp_seq=9 Destination Host Unreachable
From 192.168.113.215 icmp_seq=10 Destination Host Unreachable
From 192.168.113.215 icmp_seq=11 Destination Host Unreachable
64 bytes from 192.168.113.222: icmp_seq=12 ttl=64 time=10341 ms
64 bytes from 192.168.113.222: icmp_seq=13 ttl=64 time=10335 ms
^C
--- 192.168.113.222 ping statistics ---
35 packets transmitted, 2 received, +9 errors, 94% packet loss, time 34021ms
rtt min/avg/max/mdev = 10335.309/10338.322/10341.336/4.406 ms, pipe 11
controls@rossa:~ 0$ 

Note that 10 SECOND round trip time and 94% packet loss.  That's just beyond stupid.  I have no idea what's going on.

  12721   Mon Jan 16 12:49:06 2017 ranaConfigurationComputersMegatron update

The "apt-get update" was failing on some machines because it couldn't find the 'Debian squeeze' repos, so I made some changes so that Megatron could be upgraded.

I think Jamie set this up for us a long time ago, but now the LSC has stopped supporting these versions of the software. We're running Ubuntu12 and 'squeeze' is meant to support Ubuntu10. Ubuntu12 (which is what LLO is running) corresponds to 'Debian-wheezy' and Ubuntu14 to 'Debian-Jessie' and Ubuntu16 to 'debian-stretch'.

We should consider upgrading a few of our workstations to Ubuntu 14 LTS to see how painful it is to run our scripts and DTT and DV. Better to upgrade a bit before we are forced to by circumstance.

I followed the instructions from software.ligo.org (https://wiki.ligo.org/DASWG/DebianWheezy) and put the recommended lines into the /etc/apt/sources.list.d/lsc-debian.list file.

but I still got 1 error (previously there were ~7 errors):

W: Failed to fetch http://software.ligo.org/lscsoft/debian/dists/wheezy/Release  Unable to find expected entry 'contrib/binary-i386/Packages' in Release file (Wrong sources.list entry or malformed file)

Restarting now to see if things work. If its OK, we ought to change our squeeze lines into wheezy for all workstations so that our LSC software can be upgraded.

  12724   Mon Jan 16 22:03:30 2017 jamieConfigurationComputersMegatron update
Quote:
 

We should consider upgrading a few of our workstations to Ubuntu 14 LTS to see how painful it is to run our scripts and DTT and DV. Better to upgrade a bit before we are forced to by circumstance.

I would recommend upgrading the workstations to one of the reference operating systems, either SL7 or Debian squeeze, since that's what the sites are moving towards.  If you do that you can just install all the control room software from the supported repos, and not worry about having to compile things from source anymore.

  12849   Thu Feb 23 15:48:43 2017 johannesUpdateComputersc1psl un-bootable

Using the PDA520 detector on the AS port I tried to get some better estimates for the round-trip loss in both arms. While setting up the measurement I noticed some strange output on the scope I'm using to measure the amount of reflected light.

The interferometer was aligned using the dither scripts for both arms. Then, ITMY was majorly misaligned in pitch AND yaw such that the PD reading did not change anymore. Thus, only light reflected from the XARM was incident of the AS PD. The scope was showing strange oscillations (Channel 2 is the AS PD signal):

For the measurement we compare the DC level of the reflection with the ETM aligned (and the arm locked) vs a misaligned ETM (only ITM reflection). This ringing could be observed in both states, and was qualitatively reproducible with the other arm. It did not show up in the MC or ARM transmission. I found that changing the pitch of the 'active' ITM (=of the arm under investigation) either way by just a couple of ticks made it go away and settle roughly at the lower bound of the oscillation:

In this configuration the PD output follows the mode cleaner transmission (Channel 3 in the screen caps) quite well, but we can't take the differential measurement like this, because it is impossible to align and lock the arm but them misalign the ITM. Moving the respective other ITM for potential secondary beams did not seem to have an obvious effect, although I do suspect a ghost/secondary beam to be the culprit for this. I moved the PDA520 on the optical table but didn't see a change in the ringing amplitude. I do need to check the PD reflection though.

Obviously it will be hard to determine the arm loss this way, but for now I used the averaging function of the scope to get rid of the ringing. What this gave me was:
(16 +/- 9) ppm losses in the x-arm and (-18+/-8) ppm losses in the y-arm

The negative loss obviously makes little sense, and even the x-arm number seems a little too low to be true. I strongly suspect the ringing is responsible and wanted to investigate this further today, but a problem with c1psl came up that shut down all work on this until it is fixed:

I found the PMC unlocked this morning and c1psl (amongst other slow machines) was unresponsive, so I power-cycled them. All except c1psl came back to normal operation. The PMC transmission, as recorded by c1psl,  shows that it has been down for several days:

Repeated attempts to reset and/or power-cycle it by Gautam and myself could not bring it back. The fail indicator LED of a single daughter card (the DOUT XVME-212) turns off after reboot, all others stay lit. The sysfail LED on the crate is also on, but according to elog 10015 this is 'normal'. I'm following up that post's elog tree to monitor the startup of c1psl through its system console via a serial connection to find out what is wrong.

  12850   Thu Feb 23 18:52:53 2017 ranaUpdateComputersc1psl un-bootable

The fringes seen on the oscope are mostly likely due to the interference from multiple light beams. If there are laser beams hitting mirrors which are moving, the resultant interference signal could be modulated at several Hertz, if, for example, one of the mirrors had its local damping disabled.

  12851   Thu Feb 23 19:44:48 2017 johannesUpdateComputersc1psl un-bootable

Yes, that was one of the things that I wanted to look into. One thing Gautam and I did that I didn't mention was to reconnect the SRM satellite box and move the optic around a bit, which didn't change anything. Once the c1psl problem is fixed we'll resume with that.

Quote:

The fringes seen on the oscope are mostly likely due to the interference from multiple light beams. If there are laser beams hitting mirrors which are moving, the resultant interference signal could be modulated at several Hertz, if, for example, one of the mirrors had its local damping disabled.

 

Speaking of which:

Using one of the grey RJ45 to D-Sub cables with an RS232 to USB adapter I was able to capture the startup log of c1psl (using the usb camera windows laptop). I also logged the startup of the "healthy" c1aux, both are attached. c1psl stalls at a point were c1aux starts testing for present vme modules and doesn't continue, however is not strictly hung up, as it still registers to the logger when external login attempts via telnet occur. The telnet client simply reports that the "shell is locked" and exits. It is possible that one of the daughter cards causes this. This seems to happen after iocInit is called by the startup script at /cvs/cds/caltech/target/c1psl/startup.cmd, as it never gets to the next item "coreRelease()". Gautam and I were trying to find out what happends inside iocInit, but it's not clear to us at this point from where it is even called. iocInit.c and compiled binaries exist in several places on the shared drive. However, all belong to R3.14.x epics releases, while the logfile states that the R3.12.2 epics core is used when iocInit is called.

Next we'll interrupt the autoboot procedure and try to work with the machine directly.

Attachment 1: slow_startup_logs.tar.gz
  12852   Fri Feb 24 20:38:01 2017 johannesUpdateComputersc1psl boot-stall culprit identified

[Gautam, Johannes]

c1psl finally booted up again, PMC and IMC are locked.

Trying to identify the hickup from the source code was fruitless. However, since the PMCTRANSPD channel acqusition failure occured long before the actual slow machine crashed, and since the hickup in the boot seemed to indicate a problem with daughter module identification, we started removing the DIO and DAQ modules:

  1. Started with the ones whose fail LED stayed lit during the boot process: the DIN (XVME-212) and the three DACs (VMIVME4113). No change.
  2. Also removed the DOUT (XVME-220) and the two ADCs (VMIVME 3113A and VMIVME3123). It boots just fine and can be telnetted into!
  3. Pushed the DIN and the DACs back in. Still boots.
  4. Pushed only VMIVME3123 back in. Boot stalls again.
  5. Removed VMIVME3123, pushed VMIVME 3113A back in. Boots successfully.
  6. Left VMIVME3123 loose in the crate without electrical contact for now.
  7. Proceeded to lock PMC and IMC

The particle counter channel should be working again.

  • VMIVME3123 is a 16-Bit High-Throughput Analog Input Board, 16 Channels with Simultaneous Sample-and-Hold Inputs
  • VMIVME3113A is a Scanning 12-Bit Analog-to-Digital Converter Module with 64 channels

/cvs/cds/caltech/target/c1psl/psl.db lists the following channels for VMIVME3123:

Channels currently in use (and therefore not available in the medm screens):

  • C1:PSL-FSS_SLOW_MON
  • C1:PSL-PMC_PMCERR
  • C1:PSL-FSS_SLOWM
  • C1:PSL-FSS_MIXERM
  • C1:PSL-FSS_RMTEMP
  • C1:PSL-PMC_PMCTRANSPD

Channels not currently in use (?):

  • C1:PSL-FSS_MINCOMEAS
  • C1:PSL-FSS_RCTRANSPD
  • C1:PSL-126MOPA_126MON
  • C1:PSL-126MOPA_AMPMON
  • C1:PSL-FSS_TIDALINPUT
  • C1:PSL-FSS_TIDALSET
  • C1:PSL-FSS_RCTEMP
  • C1:PSL-PPKTP_TEMP

There are plenty of channels available on the asynchronous ADC, so we could wire the relevant ones there if we done care about the 16 bit synchronous sampling (required for proper functionality?)

Alternatively, we could prioritize the Acromag upgrade on c1psl (DAQ would still be asynchronous, though). The PCBs are coming in next Monday and the front panels on Tuesday.

 

 

Some more info that might come in handy to someone someday:

The (nameless?) Windows 7 laptop that lives near MC2 and is used for the USB microscope was used for interfacing with c1psl. No special drivers were necessary to use the USB to RS232 adapter, and the RJ45 end of the grey homemade DB9 to RJ45 cable was plugged into the top port which is labeled "console 1". I downloaded the program "CoolTerm" from http://freeware.the-meiers.org/#CoolTerm, which is a serial protocol emulator, and it worked out of the box with the adapter. The standard settings fine worked for communicating with c1psl, only a small modification was necessary: in Options>Terminal make sure that "Enter Key Emulation" is set from "CR+LF" to "CR", otherwise each time 'Enter' is pressed it is actually sent twice.

  12854   Tue Feb 28 01:28:52 2017 johannesUpdateComputersc1psl un-bootable

It turned out the 'ringing' was caused by the respective other ETM still being aligned. For these reflection measurements both test masses of the other arm need to be misaligned. For the ETM it's sufficient to use the Misalign button in the medm screens, while the ITM has to be manually misaligned to move the reflected beam off the PD.

I did another round of armloss measurements today. I encountered some problems along the way

  • Some time today (around 6pm) most of the front end models had crashed and needed to be restarted GV: actually it was only the models on c1lsc that had crashed. I noticed this on Friday too.
  • ETMX keeps getting kicked up seemingly randomly. However, it settles fast into it's original position.

General Stuff:

  • Oscilloscope should sample both MC power (from MC2 transmitted beam) and AS signal
  • Channel data can only be loaded from the scope one channel at a time, so 'stop' scope acquisition and then grab the relevant channels individually
  • Averaging needs to be restarted everytime the mirrors are moved triggering stop and run remotely via the http interface scripts does this.

Procedure:

  1.     Run LSC Offsets
  2.     With the PSL shutter closed measure scope channel dark offsets, then open shutter
  3.     Align all four test masses with dithering to make sure the IFO alignment is in a known state
  4.     Pick an arm to measure
  5.     Turn the other arm's dither alignment off
  6.     'Misalign' that arm's ETM using medm screen button
  7.     Misalign that arm's ITM manually after disabling its OpLev servos looking at the AS port camera and make sure it doesn't hit the PD anymore.
  8.     Disable dithering for primary arm
  9.     Record MC and AS time series from (paused) scope
  10.     Misalign primary ETM
  11.     Repeat scope data recording

Each pair of readings gives the reflected power at the AS port normalized to the IMC stored power:

\widehat{P}=\frac{P_{AS}-\overline{P}_{AS}^\mathrm{dark}}{P_{MC}-\overline{P}_{MC}^\mathrm{dark}}

which is then averaged. The loss is calculated from the ratio of reflected power in the locked (L) vs misaligned (M) state from

\mathcal{L}=\frac{T_1}{4\gamma}\left[1-\frac{\overline{\widehat{P}_L}}{\overline{\widehat{P}_M}} +T_1\right ]-T_2

Acquiring data this way yielded P_L/P_M=1.00507 +/- 0.00087 for the X arm and P_L/P_M=1.00753 +/- 0.00095 for the Y arm. With \gamma_x=0.832 and \gamma_x=0.875 (from m1=0.179, m2=0.226 and 91.2% and 86.7% mode matching in X and Y arm, respectively) this yields round trip losses of:

\mathcal{L}_X=21\pm4\,\mathrm{ppm}  and  \mathcal{L}_Y=13\pm4\,\mathrm{ppm}, which is assuming a generalized 1% error in test mass transmissivities and modulation indices. As we discussed, this seems a little too good to be true, but at least the numbers are not negative.

  12943   Thu Apr 13 21:01:20 2017 ranaConfigurationComputersLG UltraWide on Rossa

we installed a new curved 34" doublewide monitor on Rossa, but it seems like it has a defective dead pixel region in it. Unless it heals itself by morning, we should return it to Amazon. Please don't throw out he packing materials.

Steve 8am next morning: it is still bad The monitor is cracked. It got kicked while traveling. It's box is damaged the same place.

Shipped back 4-17-2017

Attachment 1: LG34c.jpg
LG34c.jpg
Attachment 2: crack.jpg
crack.jpg
  12965   Wed May 3 16:12:36 2017 johannesConfigurationComputerscatastrophic multiple monitor failures

It seems we lost three monitors basically overnight.

The main (landscape, left) displays of Pianosa, Rossa and Allegra are all broken with the same failure mode:

their backlights failed. Gautam and I confirmed that there is still an image displayed on all three, just incredibly faint. While Allegra hasn't been used much, we can narrow down that Pianosa's and Rossa's monitors must have failed within 5 or 6 hours of each other, last night.

One could say ... they turned to the dark side cool

Quick edit; There was a functioning Dell 24" monitor next to the iMac that we used as a replacement for Pianosa's primary display. Once the new curved display is paired with Rossa we can use its old display for Donatella or Allegra.

  12966   Wed May 3 16:46:18 2017 KojiConfigurationComputerscatastrophic multiple monitor failures

- Is there any machine that can handle 4K? I have one 4K LCD for no use.
- I also can donate one 24" Dell

  12971   Thu May 4 09:52:43 2017 ranaConfigurationComputerscatastrophic multiple monitor failures

That's a new failure mode. Probably we can't trust the power to be safe anymore.

Need Steve to order a couple of surge suppressing power strips for the monitors. The computers are already on the UPS, so they don't need it.

  12978   Tue May 9 15:23:12 2017 SteveConfigurationComputerscatastrophic multiple monitor failures

Gautam and Steve,

Surge protective power strip was install on Friday, May 5 in the Control Room

Computers not connected to the UPS are plugged into Isobar12ultra.

Quote:

That's a new failure mode. Probably we can't trust the power to be safe anymore.

Need Steve to order a couple of surge suppressing power strips for the monitors. The computers are already on the UPS, so they don't need it.

 

Attachment 1: Trip-Lite.jpg
Trip-Lite.jpg
  12993   Mon May 15 20:43:25 2017 ranaConfigurationComputerscatastrophic multiple monitor failures

this is not the right one; this Ethernet controlled strip we want in the racks for remote control.

Buy some of these for the MONITORS.

Quote:

Surge protective power strip was install on Friday, May 5 in the Control Room

Computers not connected to the UPS are plugged into Isobar12ultra.

Quote:

That's a new failure mode. Probably we can't trust the power to be safe anymore.

Need Steve to order a couple of surge suppressing power strips for the monitors. The computers are already on the UPS, so they don't need it.

 

  13037   Sun Jun 4 14:19:33 2017 ranaFrogsComputersNetwork slowdown: Martians are behind a waterwall

A few weeks ago we did some internet speed tests and found a dramatic difference between our general network and our internal Martian network in terms of access speed to the outside world.

As you can see, the speed from nodus is consistent with a Gigabit connection. But the speeds from any machine on the inside is ~100x slower. We need to take a look at our router / NAT setup to see if its an old hardware problem or just something in the software firewall. By comparison, my home internet download speed test is ~48 Mbit/s; ~6x faster than our CDS computers.


controls@megatron|~> speedtest
/usr/local/bin/speedtest:5: UserWarning: Module dap was already imported from None, but /usr/lib/python2.7/dist-packages is being added to sys.path
  from pkg_resources import load_entry_point
Retrieving speedtest.net configuration...
Testing from Caltech (131.215.115.189)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Race Communications (Los Angeles, CA) [29.63 km]: 6.52 ms
Testing download speed................................................................................
Download: 6.35 Mbit/s
Testing upload speed................................................................................................
Upload: 5.10 Mbit/s
controls@megatron|~> exit
logout
Connection to megatron closed.
controls@nodus|~ > speedtest
Retrieving speedtest.net configuration...
Testing from Caltech (131.215.115.52)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Phyber Communications (Los Angeles, CA) [29.63 km]: 2.196 ms
Testing download speed................................................................................
Download: 721.92 Mbit/s
Testing upload speed................................................................................................
Upload: 251.38 Mbit/s

Attachment 1: Screen_Shot_2017-06-04_at_1.47.47_PM.png
Screen_Shot_2017-06-04_at_1.47.47_PM.png
Attachment 2: Screen_Shot_2017-06-04_at_1.44.42_PM.png
Screen_Shot_2017-06-04_at_1.44.42_PM.png
  13044   Mon Jun 5 21:53:55 2017 ranaUpdateComputersrossa: ubuntu 16.04

With the network config, mounting, and symlinks setup, rossa is able to be used as a workstation for dataviewer and MEDM. For DTT, no luck since there is so far no lscsoft support past the Ubuntu14 stage.

  13050   Wed Jun 7 15:41:51 2017 SteveUpdateComputerswindow laptop scanned

Randy Trudeau scanned our Window laptop Dell 13" Vostro and Steve's memory stick for virus. Nothing was found. The search continues...

Rana thinks that I'm creating these virus beasts with taking pictures with Dino Capture and /or Data Ray on the window machine........

 

 

  13065   Thu Jun 15 14:24:48 2017 Kaustubh, JigyasaUpdateComputersOttavia Switched On

Today, I and Jigyasa connected the Ottavia to one of the unused monitor screens Donatella. The Ottavia CPU had a label saying 'SMOKED''. One of the past elogs, 11091, dated back in March 2015, by Jenne had an update regarding the Ottavia smelling 'burny'. It seems to be working fine for about 2 hours now. Once it is connected to the Martian Network we can test it further. The Donatella screen we used seems to have a graphic problem, a damage to the display screen. Its a minor issue and does not affect the display that much, but perhaps it'll be better to use another screen if we plan to use the Ottavia in the future. We will power it down if there is an issue with it.

  13067   Thu Jun 15 19:49:03 2017 Kaustubh, JigyasaUpdateComputersOttavia Switched On

It has been working fine the whole day(we didn't do much testing on it though). We are leaving it on for the night.

Quote:

Today, I and Jigyasa connected the Ottavia to one of the unused monitor screens Donatella. The Ottavia CPU had a label saying 'SMOKED''. One of the past elogs, 11091, dated back in March 2015, by Jenne had an update regarding the Ottavia smelling 'burny'. It seems to be working fine for about 2 hours now. Once it is connected to the Martian Network we can test it further. The Donatella screen we used seems to have a graphic problem, a damage to the display screen. Its a minor issue and does not affect the display that much, but perhaps it'll be better to use another screen if we plan to use the Ottavia in the future. We will power it down if there is an issue with it.

 

  13068   Fri Jun 16 12:37:47 2017 Kaustubh, JigyasaUpdateComputersOttavia Switched On

Ottavia had been left running overnight and it seems to work fine. There has been no smell or any noticeable problems in the working. This morning Gautam, Kaustubh and I connected Ottavia to the Matrian Network through the Netgear switch in the 40m lab area. We were able to SSH into Ottavia through Pianosa and access directories. On the ottavia itself we were able to run ipython, access the internet. Since it seems to work out fine, Kaustubh and I are going to enable the ethernet connection to Ottavia and secure the wiring now.  

Quote:

It has been working fine the whole day(we didn't do much testing on it though). We are leaving it on for the night.

Quote:

Today, I and Jigyasa connected the Ottavia to one of the unused monitor screens Donatella. The Ottavia CPU had a label saying 'SMOKED''. One of the past elogs, 11091, dated back in March 2015, by Jenne had an update regarding the Ottavia smelling 'burny'. It seems to be working fine for about 2 hours now. Once it is connected to the Martian Network we can test it further. The Donatella screen we used seems to have a graphic problem, a damage to the display screen. Its a minor issue and does not affect the display that much, but perhaps it'll be better to use another screen if we plan to use the Ottavia in the future. We will power it down if there is an issue with it.

 

 

  13071   Fri Jun 16 23:27:19 2017 Kaustubh, JigyasaUpdateComputersOttavia Connected to the Netgear Box

I just connected the Ottavia to the Netgear box and its working just fine. It'll remain switched on over the weekend.

Quote:

Kaustubh and I are going to enable the ethernet connection to Ottavia and secure the wiring now.  

 

  13154   Mon Jul 31 20:35:42 2017 KojiSummaryComputersChiara backup situation summary

Summary
- CDS Shared files system: backed up
- Chiara system itself: not backed up


controls@chiara|~> df -m
Filesystem     1M-blocks    Used Available Use% Mounted on
/dev/sda1         450420   11039    416501   3% /
udev               15543       1     15543   1% /dev
tmpfs               3111       1      3110   1% /run
none                   5       0         5   0% /run/lock
none               15554       1     15554   1% /run/shm
/dev/sdb1        2064245 1718929    240459  88% /home/cds
/dev/sdd1        1877792 1426378    356028  81% /media/fb9bba0d-7024-41a6-9d29-b14e631a2628
/dev/sdc1        1877764 1686420     95960  95% /media/40mBackup

/dev/sda1 : System boot disk
/dev/sdb1 : main cds disk file system 2TB partition of 3TB disk (1TB vacant)
/dev/sdc1 : Daily backup of /dev/sdb1 via a cron job (/opt/rtcds/caltech/c1/scripts/backup/localbackup)

/dev/sdd1 : 2014 snap shot of cds. Not actively used. USB

https://nodus.ligo.caltech.edu:8081/40m/11640

 

ELOG V3.1.3-