40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 90 of 335  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  7749   Tue Nov 27 00:26:00 2012 jamieOmnistructureComputersUbuntu update seems to have broken html input to elog on firefox

 After some system updates this evening, firefox can no longer handle the html input encoding for the elog.  I'm not sure what happened.  You can still use the "ELCode" or "plain" input encodings, but "HTML" won't work.  The problem seems to be firefox 17.  ottavia and rosalba were upgraded, while rossa and pianosa have not yet been.

I've installed chromium-browser (debranded chrome) on all the machines as a backup.  Hopefully the problem will clear itself up with the next update.  In the mean time I'll try to figure out what happened.

To use chromium: Appliations -> Internet -> Chromium

  7757   Wed Nov 28 17:40:28 2012 jamieOmnistructureComputerselog working again on firefox 17

Koji and I figured out what the problem is.  Apparently firefox 17.0 (specifically it's user-agent string) breaks fckeditor, which is the javascript toolbox the elog uses for the wysiwyg text editor.  See https://support.mozilla.org/en-US/questions/942438.

The suspect line was in elog/fckeditor/editor/js/fckeditorcode_gecko.js.  I hacked it up so that it stopped whatever crappy conditional user agent crap it was doing.  It seems to be working now.

Edit by Koji: In order to make this change work, I needed to clear the cache of firefox from Tool/Clear Recent History menu.

  7786   Tue Dec 4 20:38:51 2012 jamieOmnistructureComputersnew (beta) version of nds2 installed on control room machines

I've installed the new nds2 packages on the control room machines.

These new packages include some new and improved interfaces for python, matlab, and octave that were not previously available. See the documentation in:

  /usr/share/doc/nds2-client-doc/html/index.html

for details on how to use them.  They all work something like:

  conn = nds2.connection('fb', 8088)
  chans = conn.findChannels()
  buffers = conn.fetch(t1, t2, {c1,...})
  data = buffers(1).getData()

NOTE: the new interface for python is distinct from the one provided by pynds.  The old pynds interface should continue to work, though.

To use the new matlab interface, you have to first issue the following command:

   javaaddpath('/usr/lib/java')

I'll try to figure out a way to have that included automatically.

The old Matlab mex functions (NDS*_GetData, NDS*_GetChannel, etc.) are now provided by a new and improved package.  Those should now work "out of the box".

  7788   Tue Dec 4 23:08:46 2012 DenOmnistructureComputersnew (beta) version of nds2 installed on control room machines

Quote:

I've installed the new nds2 packages on the control room machines.

 I've tried new nds2 Java interface in Matlab. Using findChannels method of the connection class I see only slow, DQ and trend channels. I could even download data online using iterate method. When it will be possible to do the same with fast non-DQ channels?

>> conn = nds2.connection('fb', 8088);
>> conn.iterate({'C1:LSC-XARM_OUT'})
??? Java exception occurred:
java.lang.RuntimeException: No such channel.
    at nds2.nds2JNI.connection_iterate__SWIG_0(Native Method)
    at nds2.connection.iterate(connection.java:91)

  7791   Wed Dec 5 09:42:46 2012 ranaOmnistructureComputersnew (beta) version of NDS2 installed on control room machines

NDS2 is not designed for non DQ channels - it gets data from the frames, not through NDS1.

For getting the non-DQ stuff, I would just continue using our NDS1 compatible NDS mex files (this is what is used in mDV).

  7793   Wed Dec 5 16:54:29 2012 jamieOmnistructureComputersnew (beta) version of NDS2 installed on control room machines

Quote:

NDS2 is not designed for non DQ channels - it gets data from the frames, not through NDS1.

For getting the non-DQ stuff, I would just continue using our NDS1 compatible NDS mex files (this is what is used in mDV).

The NDS2 protocol is not for non-DQ, but the NDS2 client is capable of talking both the NDS1 and NDS2 protocols.

fb:8088 is an NDS1 server, so the client is talking NDS1 to fb.  It should therefore be capable of getting online data.

It doesn't seem to be seeing the online channels, though, so I'll work with Leo to figure out what's going on there.

The old mex functions, which like I said are now available, aren't capable of getting online data.

  7805   Mon Dec 10 16:28:13 2012 jamieOmnistructureComputersprogressive retrieval of online data now possible with the new NDS2 client

Leo fixed an issue with the new nds2-client packages that was preventing it from retrieving online data.  It's working now from matlab, python, and octave.

Here's an example of a dataviewer-like script in python:

#!/usr/bin/python

import sys
import nds2
from pylab import *

# channels are command line arguments
channels = sys.argv[1:]

conn = nds2.connection('fb', 8088)

fig = figure()
fig.show()
for bufs in conn.iterate(channels):
    fig.clf()
    for buf in bufs:
        plot(buf.data)
    draw()

  7859   Wed Dec 19 20:18:51 2012 ranaUpdateComputersWe are Changing the Passwerdz next week----

Be Prepared

http://xkcd.com/936/

  7920   Sat Jan 19 15:05:37 2013 JenneUpdateComputersAll front ends but c1lsc are down

Message I get from dmesg of c1sus's IOP:

[   44.372986] c1x02: Triggered the ADC
[   68.200063] c1x02: Channel Hopping Detected on one or more ADC modules !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[   68.200064] c1x02: Check GDSTP screen ADC status bits to id affected ADC modules
[   68.200065] c1x02: Code is exiting ..............
[   68.200066] c1x02: exiting from fe_code()

Right now, c1x02's max cpu indicator reads 73,000 micro seconds.  c1x05 is 4,300usec, and c1x01 seems totally fine, except that it has the 02xbad.

c1x02 has 0xbad (not 0x2bad).  All other models on c1sus, c1ioo, c1iscex and c1iscey all have 0x2bad.

Also, no models on those computers have 'heartbeats'.

C1x02 has "NO SYNC", but all other IOPs are fine.

I've tried rebooting c1sus, restarting the daqd process on fb, all to no avail.  I can ssh / ping all of the computers, but not get the models running.  Restarting the models also doesn't help.

upside_down_cat-t2.jpg

c1iscex's IOP dmesg:

[   38.626001] c1x01: Triggered the ADC
[   39.626001] c1x01: timeout 0 1000000
[   39.626001] c1x01: exiting from fe_code()

c1ioo's IOP has the same ADC channel hopping error as c1sus'.

 

  7922   Sat Jan 19 18:23:31 2013 ranaUpdateComputersAll front ends but c1lsc are down

After sshing into several machines and doing 'sudo shutdown -r now', some of them came back and ran their processes.

After hitting the reset button on the RFM switch, their diagnostic lights came back. After restarting the Dolphin task on fb:

"sudo /etc/init.d/dis_networkmgr restart"

the Dolphin diagnostic lights came up green on the FE status screen.

iscex still wouldn't come up. The awgtpman tasks on there keep trying to start but then stop due to not finding ADCs.

 

Then power cycled the IO Chassis for EX and then awtpman log files changed, but still no green lights. Then tried a soft reboot on fb and now its not booting correctly.

Hardware lights are on, but I can't telnet into it. Tried power cycling it once or twice, but no luck.

Probably Jamie will have to hook up a keyboard and monitor to it, to find out why its not booting.

FE.png

P.S. The snapshot scripts in the yellow button don't work and the MEDM screen itself is missing the time/date string on the top.

  7963   Wed Jan 30 13:50:27 2013 JenneUpdateComputersc1iscex still down

[Koji, Jenne]

We noticed that the iscex computer is still down, but the IOP is (was) running.  When we sat down to look at it, c1x01 was 'breathing', had a non-zero CPU_METER time, and the error was 0x4000, which I've never seen before.  The fb connection was still red though.  Also, it is claiming that its sync source is 1pps, not TDS like it usually is. 

Since things were different, Koji restarted the 2 other models running on iscex, with no resulting change.  We then did a 'rtcds restart all', and the IOP is no longer breathing, and the error message has changed to 0xbad.  The sync source is still 1pps.

Moral of the story:  c1iscex is still down, but temporarily showed signs of life that we wanted to record.

  7970   Thu Jan 31 10:23:39 2013 JamieUpdateComputersc1iscex still down

Quote:

[Koji, Jenne]

We noticed that the iscex computer is still down, but the IOP is (was) running.  When we sat down to look at it, c1x01 was 'breathing', had a non-zero CPU_METER time, and the error was 0x4000, which I've never seen before.  The fb connection was still red though.  Also, it is claiming that its sync source is 1pps, not TDS like it usually is. 

Since things were different, Koji restarted the 2 other models running on iscex, with no resulting change.  We then did a 'rtcds restart all', and the IOP is no longer breathing, and the error message has changed to 0xbad.  The sync source is still 1pps.

Moral of the story:  c1iscex is still down, but temporarily showed signs of life that we wanted to record.

There's definitely a timing issue with this machine.  I looked at it a bit yesterday.  I'll try to get to it by the end of the week.

  8036   Fri Feb 8 12:43:26 2013 yutaUpdateComputersvideocapture.py now supports movie capturing

I updated /opt/rtcds/caltech/c1/scripts/general/videoscripts.py so that it supports movie capturing. It saves captured images (bmp) and movies (mp4) in /users/sensoray/SensorayCaptures/ directory.
I also updated /opt/rtcds/caltech/c1/scripts/pylibs/pyndslib.py because /usr/bin/lalapps_tconvert is not working and now /usr/bin/tconvert works.
However, tconvert doesn't run on ottavia, so I need Jamie to fix it.

videocapture.py -h:
Usage:
    videocapture.py [cameraname] [options]

Example usage:
    videocapture.py MC2F -s 320x240 -t off
       (Camptures image of MC2F with the size of 320x240, without timestamp on the image. MUST RUN ON PIANOSA!)
    videocapture.py AS -m 10
       (Camptures 10 sec movie of AS with the size of 720x480. MUST RUN ON PIANOSA!)


Options:
  -h, --help          show this help message and exit
  -s SIZE             specify image size [default: 720x480]
  -t TIMESTAMP_ONOFF  timestamp on or off [default: on]
  -m MOVLENGTH        specity movie length (in sec; takes movie if specified) [default: 0]

  8062   Mon Feb 11 18:44:34 2013 JamieUpdateComputerspasswerdz changed

Quote:

Be Prepared

http://xkcd.com/936/

Password for nodus and all control room workstations has been changed.  Look for new one in usual place.

We will try to change the password on all the RTS machines soon.  For the moment, though, they remain with the old passwerd.

  8088   Fri Feb 15 15:21:07 2013 JamieUpdateComputersc1iscex IO-chassis dead

I appears that the c1iscex IO-chassis is either dead or in a very bad state.  The PCIe interface card in the IO-chassis is showing four red lights, where it's supposed to be showing a dozen or so green lights.  Obviously this is going to prevent anything from running.

We've had power issues with this chassis before, so possibly that's what we're running into now.  I'll pull the chassis and diagnose asap.

 

  8140   Fri Feb 22 20:28:17 2013 JamieUpdateComputerslinux1 dead, then undead

At around 2:30pm today something brought down most of the martian network.  All control room workstations, nodus, etc. were unresponsive.  After poking around for a bit I finally figured it had to be linux1, which serves the NFS filesystem for all the important CDS stuff.  linux1 was indeed completely unresponsive.

Looking closer I noticed that the Fibrenetix FX-606-U4 SCSI hardware RAID device connected to linux1 (see #1901), which holds cds network filesystem, was showing "IDE Channel #4 Error Reading" on it's little LCD display.  I assumed this was the cause of the linux1 crash.

I hard shutdown linux1, and powered off the Fibrenetix device.  I pulled the disk from slot 4 and replaced it with one of the spares we had in the control room cabinets.  I powered the device back up and it beeped for a while.  Unfortunately the device was requiring a password to access it from the front panel, and I could find no manual for the device in the lab, nor does the manufacturer offer the manual on it's web site.

Eventually I was able to get linux1 fully rebooted (after some fscks) and it seemed to mount the hardware RAID (as /dev/sdc1) fine.  The brought the NFS back.  I had to reboot nodus to get it recovered, but all the control room and front-end linux machines seemed to recover on their own (although the front-ends did need an mxstream restart).

The remaining problem is that the linux1 hardware RAID device is still currently unaccessible, and it's not clear to me that it's actually synced the new disk that I put in it.  In other words I have very little confidence that we actually have an operational RAID for /opt/rtcds.  I've contacted the LDAS guys (ie. Dan Kozak) who are managing the 40m backup to confirm that the backup is legit.  In the mean time I'm going to spec out some replacement disks onto which to copy /opt/rtcds, and also so that we can get rid of this old SCSI RAID thing.

Attachment 1: FX-606-U4_1205.pdf
FX-606-U4_1205.pdf
  8141   Sat Feb 23 00:34:28 2013 yutaUpdateComputerscrontab in op340m deleted and restored (maybe)

I accidentally overwrote crontab in op340m with an empty file.
By checking /var/cron in op340m, I think I restored it.

But somehow, autolockMCmain40m does not work in cron job, so it is currently running by nohup.

What I did:
  1. I ssh-ed op340m to edit crontab to change MC autolocker to usual power mode. I used "crontab -e", but it did not show anything. I exited emacs and op340m.
  2. Rana found that the file size of crontab went 0 when I did "crontab -e".
  3. I found my elog #6899 and added one line to crontab

55 * * * *  /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/MC/autolockMCmain40m >/cvs/cds/caltech/logs/scripts/mclock.cronlog 2>&1

  4. It didn't run correctly, so Rana used his hidden power "nohup" to run autolockMCmain40m in background.
  5. Koji's hidden magic "/var/cron/log" gave me inspiration about what was in crontab. So, I made a new crontab in op340m like this;

34 * * * *  /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/MC/autolockMCmain40m >/cvs/cds/caltech/logs/scripts/mclock.cronlog 2>&1
55 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/RCthermalPID.pl >/cvs/cds/caltech/logs/scripts/RCthermalPID.cronlog 2>&1
07 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/FSSSlowServo >/cvs/cds/caltech/logs/scripts/FSSslow.cronlog 2>&1
00 * * * * /opt/rtcds/caltech/c1/burt/autoburt/burt.cron >> /opt/rtcds/caltech/c1/burt/burtcron.log
13 * * * * /cvs/cds/caltech/conlog/bin/check_conlogger_and_restart_if_dead
14,44 * * * * /opt/rtcds/caltech/c1/scripts/SUS/rampdown.pl > /dev/null 2>&1


  6. It looks like some of them started running, but I haven't checked if they are working or not. We need to look into them.

Moral of the story:
  crontab needs backup.

  8144   Sat Feb 23 14:04:07 2013 KojiUpdateComputersapache retarted (Re: linux1 dead, then undead)

apache has been restarted.
How to: search "apache" on the 40m wiki

Quote:

I had to reboot nodus to get it recovered

 

  8146   Sat Feb 23 15:26:26 2013 yutaUpdateComputerscrontab in op340m updated

I found some daily cron jobs for op340m I missed last night. Also, I edited timings of hourly jobs to maintain consistency with the past. Some of them looks old, but I will leave as it is for now.
At least, burt, FSSSlowServo and autolockMCmain40m seems like they are working now.
If you notice something is missing, please add it to crontab.

07 * * * * /opt/rtcds/caltech/c1/burt/autoburt/burt.cron >> /opt/rtcds/caltech/c1/burt/burtcron.log
13 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/FSSSlowServo >/cvs/cds/caltech/logs/scripts/FSSslow.cronlog 2>&1
14,44 * * * * /cvs/cds/caltech/conlog/bin/check_conlogger_and_restart_if_dead
15,45 * * * * /opt/rtcds/caltech/c1/scripts/SUS/rampdown.pl > /dev/null 2>&1
55 * * * *  /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/MC/autolockMCmain40m >/cvs/cds/caltech/logs/scripts/mclock.cronlog 2>&1
59 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/RCthermalPID.pl >/cvs/cds/caltech/logs/scripts/RCthermalPID.cronlog 2>&1

00 0 * * * /var/scripts/ntp.sh > /dev/null 2>&1
00 4 * * * /opt/rtcds/caltech/c1/scripts/RGA/RGAlogger.cron >> /cvs/cds/caltech/users/rward/RGA/RGAcron.out 2>&1
00 6 * * * /cvs/cds/scripts/backupScripts.pl
00 7 * * * /opt/rtcds/caltech/c1/scripts/AutoUpdate/update_conlog.cron

  8147   Sat Feb 23 15:46:16 2013 ranaUpdateComputerscrontab in op340m updated

According to Google, you can add a line in the crontab to backup the crontab by having the cronback.py script be in the scripts/ directory. It needs to save multiple copies, or else when someone makes the file size zero it will just write a zero size file onto the old backup.

  8181   Wed Feb 27 11:22:54 2013 yutaUpdateComputersbackup crontab

I made a simple script to backup crontab (/opt/rtcds/caltech/c1/scripts/crontab/backupCrontab).

#!/bin/bash

crontab -l > /opt/rtcds/caltech/c1/scripts/crontab/crontab_$(hostname).$(date '+%Y%m%d%H%M%S')


I put this script into op340m crontab.

00 8 * * * /opt/rtcds/caltech/c1/scripts/crontab/backupCrontab

It took me 30 minutes to write and check this one line script. I hate shell scripts.

  8266   Mon Mar 11 10:20:36 2013 Max HortonSummaryComputersAttempted Smart UPS 2200 Battery Replacement

Attempted Battery Replacement on Backup Power Supply in the Control Room:

I tried to replace the batteries in the Smart UPS 2200 with new batteries purchased by Steve.  However, the power port wasn't compatible with the batteries.  The battery cable's plug was too tall to fit properly into the Smart UPS port.  New batteries must be acquired.  Steve has pictures of the original battery (gray) and the new battery (blue) plugs, which look quite different (even though the company said the battery would fit).

The Correct battery connector is GRAY : APC RBC55

Attachment 1: upsB.jpg
upsB.jpg
Attachment 2: upsBa.jpg
upsBa.jpg
  8274   Tue Mar 12 00:35:56 2013 JenneUpdateComputersFB's RAID is beeping

[Manasa, Jenne]

Manasa just went inside to recenter the AS beam on the camera after our Yarm spot centering exercises of the evening, and heard a loud beeping.  We determined that it is the RAID attached to the framebuilder, which holds all of our frame data that is beeping incessantly.  The top center power switch on the back (there are FOUR power switches, and 3 power cables, btw.  That's a lot) had a red light next to it, so I power cycled the box.  After the box came back up, it started beeping again, with the same front panel message:

H/W monitor power #1 failed.

Right now the fb is trying to stay connected to things, and we can kind of use dataviewer, but we lose our connection to the framebuilder every ~30 seconds or so.  This rough timing estimate comes from how often we see the fb-related lights on the frontend status screen cycle from green to white to red back to green (or, how long do the lights stay green before going white again).  We weren't having trouble before the RAID went down a few minutes ago, so I'm hopeful that once that's fixed, the fb will be fine. 

In other news, just to make Jamie's day a little bit better, Dataviewer does not open on Pianosa or Rosalba.  The window opens, but it stays a blank grey box.  This has been going on for Pianosa for a few days, but it's new (to me at least) on Rosalba.  This is different from the lack of ability to connect to the fb that Rossa and Ottavia are seeing.

  8278   Tue Mar 12 12:06:22 2013 JamieUpdateComputersFB recovered, RAID power supply #1 dead

The framebuilder RAID is back online.  The disk had been mounted read-only (see below) so daqd couldn't write frames, which was in turn causing it to segfault immediately, so it was constantly restarting.

The jetstor RAID unit itself has a dead power supply.  This is not fatal, since it has three.  It has three so it can continue to function if one fails.  I have removed the bad supply and gave it to Steve so he can get a suitable replacement.

Some recovery had to be done on fb to get everything back up and running again.  I ran into issues trying to do it on the fly, so I eventually just rebooted.  It seemed to come back ok, except for something going on with daqd.  It was reporting the following error upon restart:

[Tue Mar 12 11:43:54 2013] main profiler warning: 0 empty blocks in the buffer

It was spitting out this message about once a second, until eventually the daqd died.  When it restarted it seemed to come back up fine.  I'm not exactly clear what those messages were about, but I think it has something to do with not being able to dump it's data buffers to disk.  I'm guessing that this was a residual problem from the umounted /frames, which somehow cleared on it's own.  Everything seems to be ok now.

Quote:

Manasa just went inside to recenter the AS beam on the camera after our Yarm spot centering exercises of the evening, and heard a loud beeping.  We determined that it is the RAID attached to the framebuilder, which holds all of our frame data that is beeping incessantly.  The top center power switch on the back (there are FOUR power switches, and 3 power cables, btw.  That's a lot) had a red light next to it, so I power cycled the box.  After the box came back up, it started beeping again, with the same front panel message:

H/W monitor power #1 failed.

DO NOT DO THIS.  This is what caused all the problems.  The unit has three redundant power supplies, for just this reason.  It was probably continuing to function fine.  The beeping was just to tell you that there was something that needed attention.  Rebooting the device does nothing to solve the problem.  Rebooting in an attempt to silence beeping is not a solution.  Shutting of the RAID unit is basically the equivalent of ripping out a mounted external USB drive.  You can damage the filesystem that way.  The disk was still functioning properly.  As far as I understand it the only problem was the beeping, and there were no other issues.  After you hard rebooted the device, fb lost it's mounted disk and then went into emergency mode, which was to remount the disk read-only.  It didn't understand what was going on, only that the disk seemed to disappear and the reappear.  This was then what caused the problems.  It was not the beeping, it was the restarting the RAID that was mounted on fb.

Computers are not like regular pieces of hardware.  You can't just yank the power on them.  Worse yet is yanking the power on a device that is connected to a computer.  DON"T DO THIS UNLESS YOU KNOW WHAT YOU"RE DOING.  If the device is a disk drive, then doing this is a sure-fire way to damage data on disk.

 

  8280   Tue Mar 12 14:51:00 2013 SteveUpdateComputersbuy warranty or not ?

 Details of the warranties are posted on wiki power supply cost, warranty described, cost

.......I’ve also attached a warranty renewal quote.  A 1 year warranty renewal is usually $.... per year, but we gave you special pricing of $.... / year if you renew both units.  This pricing is also special due to the fact that both warranties expired awhile ago.  We usually require that the warranty renewal begin on the date of expiration, but we will waive this for you this time if both are renewed.

 

JetStor SATA 416S, SN: SB09040111A3 – expired 04/24/2012 (3 years old)

 

JetStor SATA 516F, SN: SB09080016P – expired on 08/21/2012........

 

. Are we keep it for an other 2 years? buy warranty or buy better storage.

 

  8324   Thu Mar 21 10:29:12 2013 ManasaUpdateComputersComputers down since last night

I'm trying to figure out what went wrong last night. But the morning status...the computers are down.

 

down.png

Attachment 1: down.png
down.png
  8325   Thu Mar 21 12:04:05 2013 ManasaUpdateComputersFixed

All FE computers are back.

Restart procedure:

0a. Restart frame builder: telnet fb 8087 & type shutdown

0b. Restart mx_stream from the FE overview screen

1. I ssh ed to the computer. (c1lsc, c1ioo, c1iscex, c1isey)

2. I used 'sudo shutdown -r (computername)'. They came back ON.

3. While rebooting c1ioo, c1sus shutdown (for reasons I don't know). I could not ping or ssh c1SUS after this.

4. I went in and switched c1SUS computer OFF and back ON after which I could ssh to it.

5. I did the same reboot procedure for c1SUS.

6. I had to restart some of the models individually.
    (i) ssh to the computer running the model
    (ii) rtcds restart 'model name'

7. All computers are back now.

up.png

  8326   Thu Mar 21 12:33:51 2013 ranaUpdateComputersFixed

Please stop power cycling computers. This is not an acceptable operation (as Jamie already wrote before). When you don't know what to do besides power cycling the computer, just stop and do something else or call someone who knows more. Every time you kill the power to a computer you are taking a chance on damaging it or corrupting some hard drive.

In this case, the right thing to do would be to hook up the external keyboard and monitor directly to c1sus to diagnose things.

NO MORE TOUCHING THE POWER BUTTON.

  8334   Mon Mar 25 09:52:22 2013 JenneUpdateComputersc1lsc mxstream won't restart

Most of the front ends' mx streams weren't running, so I did the old mxstreamrestart on all machines (see elog 6574....the dmesg on c1lsc right now, at the top, has similar messages).  Usually this mxstream restart works flawlessly, but today c1lsc isn't working.  Usually to the right side of the terminal window I get an [ok] when things work.  For the lsc machine today, I get [!!] instead. 

After having learned from recent lessons, I'm waiting to hear from Jamie.

  8335   Mon Mar 25 11:42:45 2013 JamieUpdateComputersc1lsc mx_stream ok

I'm not exactly sure what the problem was here, but I think it had to do with a stuck mx_stream process that wasn't being killed properly.  I manually killed the process and it seemed to come up fine after that.  The regular restart mechanisms should work now.

No idea what caused the process to hang in the first place, although I know the newer RCG (2.6) is supposed to address some of these mx_stream issues.

  8366   Thu Mar 28 10:44:30 2013 ManasaUpdateComputersc1lsc down

c1lsc was down this morning.

I restarted fb and c1lsc based on elog

Everything but c1oaf came back. I tried to restart c1oaf individually; but it didn't work.

Before:

cds_FE.png

After:

cds_fe1.png

  8367   Thu Mar 28 12:50:52 2013 JenneUpdateComputersc1lsc is fine

 Manasa told me that she did things in a different order than her old elog. 

She had

(1) ssh'ed to c1lsc and did a remote shutdown / restart,

(2) restarted fb,

(3) restarted the mxstream on c1lsc,

(4) restarted each model individually in some order that I forgot to ask.

However, with the situation as in her "before" screenshot, all that needed to be done was restart the mxstream process on c1lsc. 

Anyhow, when I looked at the OAF model, it was complaining of "no sync", so I restarted the model, and it came back up fine.  All is well again.

  8374   Fri Mar 29 17:24:43 2013 JamieUpdateComputersFB RAID power supply replaced

Steve ordered a replacement power supply for the FB JetStor power supply that failed a couple weeks ago.  I just installed it and it looks fine.

  8394   Tue Apr 2 20:52:35 2013 ranaUpdateComputersiMac bashed

 I changed the default shell on our control room iMac to bash. Since we're really, really using bash as the shell for LIGO, we might as well get used to it. As we do this for the workstations, some things will fail, but we can adopt Jamie's private .bashrc to get started and then fix it up later.

  8398   Wed Apr 3 01:32:04 2013 JenneUpdateComputersupdated EPICS database (channels selected for saving)

I modified /opt/rtcds/caltech/c1/chans/daq/C0EDCU.ini to include the C1:LSC-DegreeOfFreedom_TRIG_MON channels.  These are the same channel that cause the LSC screen trigger indicators to light up. 

I vaguely followed Koji's directions in elog 5991, although I didn't add new grecords, since these channels are already included in the .db file as a result of EpicsOut blocks in the simulink model.  So really, I only did Step 2.  I still need to restart the framebuilder, but locking (attempt at locking) is happening.

The idea here is that we should be able to search through this channel, and when we get a trigger, we can go back and plot useful signals (PDs, error signals, cotrol signals,....), and try to figure out why we're losing lock. 

Rana tells me that this is similar to an old LockAcq script that would run DTT and get data.

EDIT:  I restarted the daqd on the fb, and I now see the channel in dataviewer, but I can only get live data, no past data, even though it says that it is (16,float).  Here's what Dataviewer is telling me:

Connecting to NDS Server fb (TCP port 8088)
Connecting.... done
read(); errno=0
LONG: DataRead = -1
No data found

read(); errno=9
read(); errno=9
T0=13-03-29-08-59-43; Length=432010 (s)
No data output.

 

  8400   Wed Apr 3 14:45:34 2013 JamieUpdateComputersupdated EPICS database (channels selected for saving)

Quote:

I modified /opt/rtcds/caltech/c1/chans/daq/C0EDCU.ini to include the C1:LSC-DegreeOfFreedom_TRIG_MON channels.  These are the same channel that cause the LSC screen trigger indicators to light up. 

I vaguely followed Koji's directions in elog 5991, although I didn't add new grecords, since these channels are already included in the .db file as a result of EpicsOut blocks in the simulink model.  So really, I only did Step 2.  I still need to restart the framebuilder, but locking (attempt at locking) is happening.

The idea here is that we should be able to search through this channel, and when we get a trigger, we can go back and plot useful signals (PDs, error signals, cotrol signals,....), and try to figure out why we're losing lock. 

Rana tells me that this is similar to an old LockAcq script that would run DTT and get data.

EDIT:  I restarted the daqd on the fb, and I now see the channel in dataviewer, but I can only get live data, no past data, even though it says that it is (16,float).  Here's what Dataviewer is telling me:

Connecting to NDS Server fb (TCP port 8088)
Connecting.... done
read(); errno=0
LONG: DataRead = -1
No data found

read(); errno=9
read(); errno=9
T0=13-03-29-08-59-43; Length=432010 (s)
No data output.
 

I seem to be able to retrieve these channels ok from the past:

controls@pianosa:/opt/rtcds/caltech/c1/scripts 0$ tconvert 1049050000
Apr 03 2013 18:46:24 UTC
controls@pianosa:/opt/rtcds/caltech/c1/scripts 0$ ./general/getdata -s 1049050000 -d 10 --noplot C1:LSC-PRCL_TRIG_MON
Connecting to server fb:8088 ...
nds_logging_init: Entrynds_logging_init: Exit
fetching... 1049050000.0
Hit any key to exit: 
controls@pianosa:/opt/rtcds/caltech/c1/scripts 0$ 

Maybe DTT just needed to be reloaded/restarted?

  8444   Thu Apr 11 11:58:21 2013 JenneUpdateComputersLSC whitening c-code ready

The big hold-up with getting the LSC whitening triggering ready has been a problem with running the c-code on the front end models.  That problem has now been solved (Thanks Alex!), so I can move forward.

The background:

We want the RFPD whitening filters to be OFF while in acquisition mode, but after we lock, we want to turn the analog whitening (and the digital compensation) ON.  The difference between this and the other DoF and filter module triggers is that we must parse the input matrix to see which PD is being used for locking at that time.  It is the c-code that parses this matrix that has been causing trouble.  I have been testing this code on the c1tst.mdl, which runs on the Y-end computer.  Every time I tried to compile and run the c1tst model, the entire Y-end computer would crash.

The solution:

Alex came over to look at things with Jamie and me.  In the 2.5 version of the RCG (which we are still using), there is an optimization flag "-O3" in the make file.  This optimization, while it can make models run a little faster, has been known in the past to cause problems.  Here at the 40m, our make files had an if-statement, so that the c1pem model would compile using the "-O" optimization flag instead, so clearly we had seen the problem here before, probably when Masha was here and running the neural network code on the pem model.  In the RCG 2.6 release, all models are compiled using the "-O" flag.  We tried compiling the c1tst model with this "-O" optimization, and the model started and the computer is just fine.  This solved the problem.

Since we are going to upgrade to RCG 2.6 in the near-ish future anyway, Alex changed our make files so that all models will now compile with the "-O" flagWe should monitor other models when we recompile them, to make sure none of them start running long with the different optimization. 

The future:

Implement LSC whitening triggering!

  8479   Tue Apr 23 22:10:54 2013 ranaUpdateComputersNancy

controls@rosalba:/users/rana/docs 0$ svn resolve --accept working nancy
Resolved conflicted state of 'nancy'

  8529   Sat May 4 00:21:00 2013 ranaConfigurationComputersworkstation updates

 Koji and I went into "Update Manager" on several of the Ubuntu workstations and unselected the "check for updates" button. This is to prevent the machines from asking to be upgraded so frequently - I am concerned that someone might be tempted to upgrade the workstations to Ubuntu 12.

We didn't catch them all, so please take a moment to check that this is the case on all the laptops you are using and make it so. We can then apply the updates in a controlled manner once every few months.

  8540   Tue May 7 17:43:51 2013 JamieUpdateComputers40MARS wireless network problems

I'm not sure what's going on today but we're seeing ~80% packet loss on the 40MARS wireless network.  This is obviously causing big problems for all of our wirelessly connected machines.  The wired network seems to be fine.

I've tried power cycling the wireless router but it didn't seem to help.  Not sure what's going on, or how it got this way.  Investigating...

  8541   Tue May 7 18:16:37 2013 JamieUpdateComputers40MARS wireless network problems

Here's an example of the total horribleness of what's happening right now:

controls@rossa:~ 0$ ping 192.168.113.222
PING 192.168.113.222 (192.168.113.222) 56(84) bytes of data.
From 192.168.113.215 icmp_seq=2 Destination Host Unreachable
From 192.168.113.215 icmp_seq=3 Destination Host Unreachable
From 192.168.113.215 icmp_seq=4 Destination Host Unreachable
From 192.168.113.215 icmp_seq=5 Destination Host Unreachable
From 192.168.113.215 icmp_seq=6 Destination Host Unreachable
From 192.168.113.215 icmp_seq=7 Destination Host Unreachable
From 192.168.113.215 icmp_seq=9 Destination Host Unreachable
From 192.168.113.215 icmp_seq=10 Destination Host Unreachable
From 192.168.113.215 icmp_seq=11 Destination Host Unreachable
64 bytes from 192.168.113.222: icmp_seq=12 ttl=64 time=10341 ms
64 bytes from 192.168.113.222: icmp_seq=13 ttl=64 time=10335 ms
^C
--- 192.168.113.222 ping statistics ---
35 packets transmitted, 2 received, +9 errors, 94% packet loss, time 34021ms
rtt min/avg/max/mdev = 10335.309/10338.322/10341.336/4.406 ms, pipe 11
controls@rossa:~ 0$ 

Note that 10 SECOND round trip time and 94% packet loss.  That's just beyond stupid.  I have no idea what's going on.

  12721   Mon Jan 16 12:49:06 2017 ranaConfigurationComputersMegatron update

The "apt-get update" was failing on some machines because it couldn't find the 'Debian squeeze' repos, so I made some changes so that Megatron could be upgraded.

I think Jamie set this up for us a long time ago, but now the LSC has stopped supporting these versions of the software. We're running Ubuntu12 and 'squeeze' is meant to support Ubuntu10. Ubuntu12 (which is what LLO is running) corresponds to 'Debian-wheezy' and Ubuntu14 to 'Debian-Jessie' and Ubuntu16 to 'debian-stretch'.

We should consider upgrading a few of our workstations to Ubuntu 14 LTS to see how painful it is to run our scripts and DTT and DV. Better to upgrade a bit before we are forced to by circumstance.

I followed the instructions from software.ligo.org (https://wiki.ligo.org/DASWG/DebianWheezy) and put the recommended lines into the /etc/apt/sources.list.d/lsc-debian.list file.

but I still got 1 error (previously there were ~7 errors):

W: Failed to fetch http://software.ligo.org/lscsoft/debian/dists/wheezy/Release  Unable to find expected entry 'contrib/binary-i386/Packages' in Release file (Wrong sources.list entry or malformed file)

Restarting now to see if things work. If its OK, we ought to change our squeeze lines into wheezy for all workstations so that our LSC software can be upgraded.

  12724   Mon Jan 16 22:03:30 2017 jamieConfigurationComputersMegatron update
Quote:
 

We should consider upgrading a few of our workstations to Ubuntu 14 LTS to see how painful it is to run our scripts and DTT and DV. Better to upgrade a bit before we are forced to by circumstance.

I would recommend upgrading the workstations to one of the reference operating systems, either SL7 or Debian squeeze, since that's what the sites are moving towards.  If you do that you can just install all the control room software from the supported repos, and not worry about having to compile things from source anymore.

  12849   Thu Feb 23 15:48:43 2017 johannesUpdateComputersc1psl un-bootable

Using the PDA520 detector on the AS port I tried to get some better estimates for the round-trip loss in both arms. While setting up the measurement I noticed some strange output on the scope I'm using to measure the amount of reflected light.

The interferometer was aligned using the dither scripts for both arms. Then, ITMY was majorly misaligned in pitch AND yaw such that the PD reading did not change anymore. Thus, only light reflected from the XARM was incident of the AS PD. The scope was showing strange oscillations (Channel 2 is the AS PD signal):

For the measurement we compare the DC level of the reflection with the ETM aligned (and the arm locked) vs a misaligned ETM (only ITM reflection). This ringing could be observed in both states, and was qualitatively reproducible with the other arm. It did not show up in the MC or ARM transmission. I found that changing the pitch of the 'active' ITM (=of the arm under investigation) either way by just a couple of ticks made it go away and settle roughly at the lower bound of the oscillation:

In this configuration the PD output follows the mode cleaner transmission (Channel 3 in the screen caps) quite well, but we can't take the differential measurement like this, because it is impossible to align and lock the arm but them misalign the ITM. Moving the respective other ITM for potential secondary beams did not seem to have an obvious effect, although I do suspect a ghost/secondary beam to be the culprit for this. I moved the PDA520 on the optical table but didn't see a change in the ringing amplitude. I do need to check the PD reflection though.

Obviously it will be hard to determine the arm loss this way, but for now I used the averaging function of the scope to get rid of the ringing. What this gave me was:
(16 +/- 9) ppm losses in the x-arm and (-18+/-8) ppm losses in the y-arm

The negative loss obviously makes little sense, and even the x-arm number seems a little too low to be true. I strongly suspect the ringing is responsible and wanted to investigate this further today, but a problem with c1psl came up that shut down all work on this until it is fixed:

I found the PMC unlocked this morning and c1psl (amongst other slow machines) was unresponsive, so I power-cycled them. All except c1psl came back to normal operation. The PMC transmission, as recorded by c1psl,  shows that it has been down for several days:

Repeated attempts to reset and/or power-cycle it by Gautam and myself could not bring it back. The fail indicator LED of a single daughter card (the DOUT XVME-212) turns off after reboot, all others stay lit. The sysfail LED on the crate is also on, but according to elog 10015 this is 'normal'. I'm following up that post's elog tree to monitor the startup of c1psl through its system console via a serial connection to find out what is wrong.

  12850   Thu Feb 23 18:52:53 2017 ranaUpdateComputersc1psl un-bootable

The fringes seen on the oscope are mostly likely due to the interference from multiple light beams. If there are laser beams hitting mirrors which are moving, the resultant interference signal could be modulated at several Hertz, if, for example, one of the mirrors had its local damping disabled.

  12851   Thu Feb 23 19:44:48 2017 johannesUpdateComputersc1psl un-bootable

Yes, that was one of the things that I wanted to look into. One thing Gautam and I did that I didn't mention was to reconnect the SRM satellite box and move the optic around a bit, which didn't change anything. Once the c1psl problem is fixed we'll resume with that.

Quote:

The fringes seen on the oscope are mostly likely due to the interference from multiple light beams. If there are laser beams hitting mirrors which are moving, the resultant interference signal could be modulated at several Hertz, if, for example, one of the mirrors had its local damping disabled.

 

Speaking of which:

Using one of the grey RJ45 to D-Sub cables with an RS232 to USB adapter I was able to capture the startup log of c1psl (using the usb camera windows laptop). I also logged the startup of the "healthy" c1aux, both are attached. c1psl stalls at a point were c1aux starts testing for present vme modules and doesn't continue, however is not strictly hung up, as it still registers to the logger when external login attempts via telnet occur. The telnet client simply reports that the "shell is locked" and exits. It is possible that one of the daughter cards causes this. This seems to happen after iocInit is called by the startup script at /cvs/cds/caltech/target/c1psl/startup.cmd, as it never gets to the next item "coreRelease()". Gautam and I were trying to find out what happends inside iocInit, but it's not clear to us at this point from where it is even called. iocInit.c and compiled binaries exist in several places on the shared drive. However, all belong to R3.14.x epics releases, while the logfile states that the R3.12.2 epics core is used when iocInit is called.

Next we'll interrupt the autoboot procedure and try to work with the machine directly.

Attachment 1: slow_startup_logs.tar.gz
  12852   Fri Feb 24 20:38:01 2017 johannesUpdateComputersc1psl boot-stall culprit identified

[Gautam, Johannes]

c1psl finally booted up again, PMC and IMC are locked.

Trying to identify the hickup from the source code was fruitless. However, since the PMCTRANSPD channel acqusition failure occured long before the actual slow machine crashed, and since the hickup in the boot seemed to indicate a problem with daughter module identification, we started removing the DIO and DAQ modules:

  1. Started with the ones whose fail LED stayed lit during the boot process: the DIN (XVME-212) and the three DACs (VMIVME4113). No change.
  2. Also removed the DOUT (XVME-220) and the two ADCs (VMIVME 3113A and VMIVME3123). It boots just fine and can be telnetted into!
  3. Pushed the DIN and the DACs back in. Still boots.
  4. Pushed only VMIVME3123 back in. Boot stalls again.
  5. Removed VMIVME3123, pushed VMIVME 3113A back in. Boots successfully.
  6. Left VMIVME3123 loose in the crate without electrical contact for now.
  7. Proceeded to lock PMC and IMC

The particle counter channel should be working again.

  • VMIVME3123 is a 16-Bit High-Throughput Analog Input Board, 16 Channels with Simultaneous Sample-and-Hold Inputs
  • VMIVME3113A is a Scanning 12-Bit Analog-to-Digital Converter Module with 64 channels

/cvs/cds/caltech/target/c1psl/psl.db lists the following channels for VMIVME3123:

Channels currently in use (and therefore not available in the medm screens):

  • C1:PSL-FSS_SLOW_MON
  • C1:PSL-PMC_PMCERR
  • C1:PSL-FSS_SLOWM
  • C1:PSL-FSS_MIXERM
  • C1:PSL-FSS_RMTEMP
  • C1:PSL-PMC_PMCTRANSPD

Channels not currently in use (?):

  • C1:PSL-FSS_MINCOMEAS
  • C1:PSL-FSS_RCTRANSPD
  • C1:PSL-126MOPA_126MON
  • C1:PSL-126MOPA_AMPMON
  • C1:PSL-FSS_TIDALINPUT
  • C1:PSL-FSS_TIDALSET
  • C1:PSL-FSS_RCTEMP
  • C1:PSL-PPKTP_TEMP

There are plenty of channels available on the asynchronous ADC, so we could wire the relevant ones there if we done care about the 16 bit synchronous sampling (required for proper functionality?)

Alternatively, we could prioritize the Acromag upgrade on c1psl (DAQ would still be asynchronous, though). The PCBs are coming in next Monday and the front panels on Tuesday.

 

 

Some more info that might come in handy to someone someday:

The (nameless?) Windows 7 laptop that lives near MC2 and is used for the USB microscope was used for interfacing with c1psl. No special drivers were necessary to use the USB to RS232 adapter, and the RJ45 end of the grey homemade DB9 to RJ45 cable was plugged into the top port which is labeled "console 1". I downloaded the program "CoolTerm" from http://freeware.the-meiers.org/#CoolTerm, which is a serial protocol emulator, and it worked out of the box with the adapter. The standard settings fine worked for communicating with c1psl, only a small modification was necessary: in Options>Terminal make sure that "Enter Key Emulation" is set from "CR+LF" to "CR", otherwise each time 'Enter' is pressed it is actually sent twice.

  12854   Tue Feb 28 01:28:52 2017 johannesUpdateComputersc1psl un-bootable

It turned out the 'ringing' was caused by the respective other ETM still being aligned. For these reflection measurements both test masses of the other arm need to be misaligned. For the ETM it's sufficient to use the Misalign button in the medm screens, while the ITM has to be manually misaligned to move the reflected beam off the PD.

I did another round of armloss measurements today. I encountered some problems along the way

  • Some time today (around 6pm) most of the front end models had crashed and needed to be restarted GV: actually it was only the models on c1lsc that had crashed. I noticed this on Friday too.
  • ETMX keeps getting kicked up seemingly randomly. However, it settles fast into it's original position.

General Stuff:

  • Oscilloscope should sample both MC power (from MC2 transmitted beam) and AS signal
  • Channel data can only be loaded from the scope one channel at a time, so 'stop' scope acquisition and then grab the relevant channels individually
  • Averaging needs to be restarted everytime the mirrors are moved triggering stop and run remotely via the http interface scripts does this.

Procedure:

  1.     Run LSC Offsets
  2.     With the PSL shutter closed measure scope channel dark offsets, then open shutter
  3.     Align all four test masses with dithering to make sure the IFO alignment is in a known state
  4.     Pick an arm to measure
  5.     Turn the other arm's dither alignment off
  6.     'Misalign' that arm's ETM using medm screen button
  7.     Misalign that arm's ITM manually after disabling its OpLev servos looking at the AS port camera and make sure it doesn't hit the PD anymore.
  8.     Disable dithering for primary arm
  9.     Record MC and AS time series from (paused) scope
  10.     Misalign primary ETM
  11.     Repeat scope data recording

Each pair of readings gives the reflected power at the AS port normalized to the IMC stored power:

\widehat{P}=\frac{P_{AS}-\overline{P}_{AS}^\mathrm{dark}}{P_{MC}-\overline{P}_{MC}^\mathrm{dark}}

which is then averaged. The loss is calculated from the ratio of reflected power in the locked (L) vs misaligned (M) state from

\mathcal{L}=\frac{T_1}{4\gamma}\left[1-\frac{\overline{\widehat{P}_L}}{\overline{\widehat{P}_M}} +T_1\right ]-T_2

Acquiring data this way yielded P_L/P_M=1.00507 +/- 0.00087 for the X arm and P_L/P_M=1.00753 +/- 0.00095 for the Y arm. With \gamma_x=0.832 and \gamma_x=0.875 (from m1=0.179, m2=0.226 and 91.2% and 86.7% mode matching in X and Y arm, respectively) this yields round trip losses of:

\mathcal{L}_X=21\pm4\,\mathrm{ppm}  and  \mathcal{L}_Y=13\pm4\,\mathrm{ppm}, which is assuming a generalized 1% error in test mass transmissivities and modulation indices. As we discussed, this seems a little too good to be true, but at least the numbers are not negative.

  12943   Thu Apr 13 21:01:20 2017 ranaConfigurationComputersLG UltraWide on Rossa

we installed a new curved 34" doublewide monitor on Rossa, but it seems like it has a defective dead pixel region in it. Unless it heals itself by morning, we should return it to Amazon. Please don't throw out he packing materials.

Steve 8am next morning: it is still bad The monitor is cracked. It got kicked while traveling. It's box is damaged the same place.

Shipped back 4-17-2017

Attachment 1: LG34c.jpg
LG34c.jpg
Attachment 2: crack.jpg
crack.jpg
  12965   Wed May 3 16:12:36 2017 johannesConfigurationComputerscatastrophic multiple monitor failures

It seems we lost three monitors basically overnight.

The main (landscape, left) displays of Pianosa, Rossa and Allegra are all broken with the same failure mode:

their backlights failed. Gautam and I confirmed that there is still an image displayed on all three, just incredibly faint. While Allegra hasn't been used much, we can narrow down that Pianosa's and Rossa's monitors must have failed within 5 or 6 hours of each other, last night.

One could say ... they turned to the dark side cool

Quick edit; There was a functioning Dell 24" monitor next to the iMac that we used as a replacement for Pianosa's primary display. Once the new curved display is paired with Rossa we can use its old display for Donatella or Allegra.

ELOG V3.1.3-