40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 91 of 339  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  7259   Thu Aug 23 17:17:49 2012 MashaSummaryComputersCode Folder Status

I cleaned up my directory (/users/masha) today. A lot of the files are just code that I experimented with, but the important files for training the classification neural network are in "neural_network_classification". The "EarthquakeData" subdirectory contains my entire dataset. Files of the form "GenerateRNNInput" are used to create input vector sets to the network, while files of the "*NeuralNetworkClassification* actually run the code that generates the neural network vectors for the classification code block in the c1pem model.

Also, the folder "feed_c", which can also be found in Den's directory, contains the neural network controller code we played around with.

  7362   Fri Sep 7 15:31:52 2012 Mike J.UpdateComputersSensoray back up

Video Capture with the Sensoray works again. Pianosa just needed mplayer installed for it to play properly.

Attachment 1: output_5.mp4
  7364   Fri Sep 7 17:24:16 2012 Mike J.UpdateComputersSensoray Video Capture

To capture video with the Sensoray, open the GUI (python ./demo.py), simply press "Save," enter a filename, and hit "Stop" when you wish to stop recording. If you want to change the video format, there is a dropdown menu labelled "Format." I recommend MP4 for standard video, and nv12 for RAW video.

  7365   Fri Sep 7 17:34:53 2012 JenneUpdateComputersSensoray Video Capture

Quote:

To capture video with the Sensoray, open the GUI (python ./demo.py), simply press "Save," enter a filename, and hit "Stop" when you wish to stop recording. If you want to change the video format, there is a dropdown menu labelled "Format." I recommend MP4 for standard video, and nv12 for RAW video.

 I also installed mplayer on rossa, so we can play the videos there.

Even though Mike won't admit it, the video stuff is all in /users/sensoray/ .  I opened the demo.py from there, and it also works.

  7495   Sun Oct 7 12:11:00 2012 AidanUpdateComputersRebooted cymac0

I rebooted cymac0 a couple of times. When I first got here it was just frozen. I rebooted it and then ran a model (x1ios). The machine froze the second time I ran ./killx1ios. I've rebooted it again.

  7498   Mon Oct 8 09:45:28 2012 jamieUpdateComputersRebooted cymac0

Quote:

I rebooted cymac0 a couple of times. When I first got here it was just frozen. I rebooted it and then ran a model (x1ios). The machine froze the second time I ran ./killx1ios. I've rebooted it again.

For context, there's a is stand-alone cymac test system running at the 40m.  It's not hooked up to anything, except for just being on the martian network (it's not currently mounting any 40m CDS filesystems, for instance).  The machine is temporarily between the 1Y4 and 1Y5 racks.

  7552   Mon Oct 15 22:24:45 2012 JenneUpdateComputersLots of new White :(

Evan and I are starting to lock, and there is lots of new, unfortunate white stuff on several different screens.

C1:TIM-PACIFIC_STRING is gone, C1:IFO-STATE (MC state) is gone, C1:LSC-PZT..._requests are gone (all 4 of them), C1:PSL-FSS_FASTSWEEPTEST from the FSS screen is gone (although I'm not sure that that one is newly gone), lots of the WF AA lights on the LSC screen are gone.

Those are the things I find in a few minutes of not really looking around.

EDIT:  IPPOS is also gone, so I can't see how my current alignment relates to old alignments.

  7570   Wed Oct 17 19:35:58 2012 KojiUpdateComputersRe: Lots of new White :(

Solved. The power code of c1iscaux was loose.
Has anyone worked around the back side of 1Y3?


I looked into the problem. I went around the channel lists for each slow machines and found the variables are supported by c1iscaux

controls@pianosa:/cvs/cds/caltech/target/c1iscaux 0$ cd /cvs/cds/caltech/target/c1iscaux
controls@pianosa:/cvs/cds/caltech/target/c1iscaux 0$ grep C1:IF *
C1IFO_STATE.db:grecord(ai,"C1:IFO-STATE")

It seemed that the machine was not responding to ping. I went to 1Y3 and found the crate was off. Actually this is not correct.
The key was on but the power was off. I looked at the back and found the power code was loose from its inlet.
Once the code was pushed in and the crate was keyed, the white boxes got back online.

Just in case I burtrestored these slow channels by the snapshot at 6:07am on Sunday.

  7574   Thu Oct 18 08:00:40 2012 jamieUpdateComputersRe: Lots of new White :(

Quote:

Solved. The power code of c1iscaux was loose.
Has anyone worked around the back side of 1Y3?


I looked into the problem. I went around the channel lists for each slow machines and found the variables are supported by c1iscaux

controls@pianosa:/cvs/cds/caltech/target/c1iscaux 0$ cd /cvs/cds/caltech/target/c1iscaux
controls@pianosa:/cvs/cds/caltech/target/c1iscaux 0$ grep C1:IF *
C1IFO_STATE.db:grecord(ai,"C1:IFO-STATE")

It seemed that the machine was not responding to ping. I went to 1Y3 and found the crate was off. Actually this is not correct.
The key was on but the power was off. I looked at the back and found the power code was loose from its inlet.
Once the code was pushed in and the crate was keyed, the white boxes got back online.

Just in case I burtrestored these slow channels by the snapshot at 6:07am on Sunday.

I was working around 1Y2 and 1Y3 when I wired the DAC in the c1lsc IO chassis in 1Y3 to the tip-tilt electronics in 1Y2.  I had to mess around in the back of 1Y3 to get it connected.  I obviously did not intend to touch anything else, but it's certainly possible that I did.

  7577   Fri Oct 19 00:55:35 2012 JenneUpdateComputersc1lsc is down (at least all of the models)

When Evan and I were dithering the BS and ITMY (see his elog), I noticed that c1lsc was acting weird.  the IOP was the only one with the blinky heartbeat.  The IOP was all green lights, but all the other models had red for the fb connection, as well as the rightmost indicator (I don't know what that one is for).  I logged on to c1lsc and ran 'rtcds restart all'.  The script didn't get anywhere beyond saying it was beginning to stop the 1st model (sup, the bottom one on the lsc list).  Then all of the cpus went white.  I can still ping c1lsc, but I can't ssh to it.

I'm not sure what to do here Jamie.  Heelp. 

  7746   Mon Nov 26 18:56:34 2012 JenneHowToComputersData logging suggestions

We've been talking for a while about how we want to store data.  I'm not in love with keeping it on the elog, although I think we should always be able to reference and go back and forth between the elogs and the data.

I have made a new folder: /data    EDIT: nevermind.  I want it to be on the file system just like /users, but I don't know how to do that.  Right now the folder is just on Ottavia. Jamie will help me tomorrow.

In this folder, we will save all of the data which goes into the elog. 

I propose that we should have a common format for the names of the data files, so that we can easily find things.

My proposal is that one begins ones elog regarding the data to be saved, and submit it immediately after putting in the first ~sentence or so. One should then make a new folder inside the data folder with a title "elog#####_Anything_Else_You_Want" Then, data (which was originally saved in ones own users folder) should be copied into the /data/elog#####_AnythingElse/ folder. Also in that folder should be any Matlab scripts used to create the plots that you post in the elog.  One should then edit the elog to continue making a regular, very thorough elog, including the path to the data.  Elog should include all of the information about the measurement, state of the IFO (or whatever you were measuring), etc. 

Riju will be alpha-testing this procedure tonight.  EDIT: nevermind...see previous edit.

  7749   Tue Nov 27 00:26:00 2012 jamieOmnistructureComputersUbuntu update seems to have broken html input to elog on firefox

 After some system updates this evening, firefox can no longer handle the html input encoding for the elog.  I'm not sure what happened.  You can still use the "ELCode" or "plain" input encodings, but "HTML" won't work.  The problem seems to be firefox 17.  ottavia and rosalba were upgraded, while rossa and pianosa have not yet been.

I've installed chromium-browser (debranded chrome) on all the machines as a backup.  Hopefully the problem will clear itself up with the next update.  In the mean time I'll try to figure out what happened.

To use chromium: Appliations -> Internet -> Chromium

  7757   Wed Nov 28 17:40:28 2012 jamieOmnistructureComputerselog working again on firefox 17

Koji and I figured out what the problem is.  Apparently firefox 17.0 (specifically it's user-agent string) breaks fckeditor, which is the javascript toolbox the elog uses for the wysiwyg text editor.  See https://support.mozilla.org/en-US/questions/942438.

The suspect line was in elog/fckeditor/editor/js/fckeditorcode_gecko.js.  I hacked it up so that it stopped whatever crappy conditional user agent crap it was doing.  It seems to be working now.

Edit by Koji: In order to make this change work, I needed to clear the cache of firefox from Tool/Clear Recent History menu.

  7786   Tue Dec 4 20:38:51 2012 jamieOmnistructureComputersnew (beta) version of nds2 installed on control room machines

I've installed the new nds2 packages on the control room machines.

These new packages include some new and improved interfaces for python, matlab, and octave that were not previously available. See the documentation in:

  /usr/share/doc/nds2-client-doc/html/index.html

for details on how to use them.  They all work something like:

  conn = nds2.connection('fb', 8088)
  chans = conn.findChannels()
  buffers = conn.fetch(t1, t2, {c1,...})
  data = buffers(1).getData()

NOTE: the new interface for python is distinct from the one provided by pynds.  The old pynds interface should continue to work, though.

To use the new matlab interface, you have to first issue the following command:

   javaaddpath('/usr/lib/java')

I'll try to figure out a way to have that included automatically.

The old Matlab mex functions (NDS*_GetData, NDS*_GetChannel, etc.) are now provided by a new and improved package.  Those should now work "out of the box".

  7788   Tue Dec 4 23:08:46 2012 DenOmnistructureComputersnew (beta) version of nds2 installed on control room machines

Quote:

I've installed the new nds2 packages on the control room machines.

 I've tried new nds2 Java interface in Matlab. Using findChannels method of the connection class I see only slow, DQ and trend channels. I could even download data online using iterate method. When it will be possible to do the same with fast non-DQ channels?

>> conn = nds2.connection('fb', 8088);
>> conn.iterate({'C1:LSC-XARM_OUT'})
??? Java exception occurred:
java.lang.RuntimeException: No such channel.
    at nds2.nds2JNI.connection_iterate__SWIG_0(Native Method)
    at nds2.connection.iterate(connection.java:91)

  7791   Wed Dec 5 09:42:46 2012 ranaOmnistructureComputersnew (beta) version of NDS2 installed on control room machines

NDS2 is not designed for non DQ channels - it gets data from the frames, not through NDS1.

For getting the non-DQ stuff, I would just continue using our NDS1 compatible NDS mex files (this is what is used in mDV).

  7793   Wed Dec 5 16:54:29 2012 jamieOmnistructureComputersnew (beta) version of NDS2 installed on control room machines

Quote:

NDS2 is not designed for non DQ channels - it gets data from the frames, not through NDS1.

For getting the non-DQ stuff, I would just continue using our NDS1 compatible NDS mex files (this is what is used in mDV).

The NDS2 protocol is not for non-DQ, but the NDS2 client is capable of talking both the NDS1 and NDS2 protocols.

fb:8088 is an NDS1 server, so the client is talking NDS1 to fb.  It should therefore be capable of getting online data.

It doesn't seem to be seeing the online channels, though, so I'll work with Leo to figure out what's going on there.

The old mex functions, which like I said are now available, aren't capable of getting online data.

  7805   Mon Dec 10 16:28:13 2012 jamieOmnistructureComputersprogressive retrieval of online data now possible with the new NDS2 client

Leo fixed an issue with the new nds2-client packages that was preventing it from retrieving online data.  It's working now from matlab, python, and octave.

Here's an example of a dataviewer-like script in python:

#!/usr/bin/python

import sys
import nds2
from pylab import *

# channels are command line arguments
channels = sys.argv[1:]

conn = nds2.connection('fb', 8088)

fig = figure()
fig.show()
for bufs in conn.iterate(channels):
    fig.clf()
    for buf in bufs:
        plot(buf.data)
    draw()

  7859   Wed Dec 19 20:18:51 2012 ranaUpdateComputersWe are Changing the Passwerdz next week----

Be Prepared

http://xkcd.com/936/

  7920   Sat Jan 19 15:05:37 2013 JenneUpdateComputersAll front ends but c1lsc are down

Message I get from dmesg of c1sus's IOP:

[   44.372986] c1x02: Triggered the ADC
[   68.200063] c1x02: Channel Hopping Detected on one or more ADC modules !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[   68.200064] c1x02: Check GDSTP screen ADC status bits to id affected ADC modules
[   68.200065] c1x02: Code is exiting ..............
[   68.200066] c1x02: exiting from fe_code()

Right now, c1x02's max cpu indicator reads 73,000 micro seconds.  c1x05 is 4,300usec, and c1x01 seems totally fine, except that it has the 02xbad.

c1x02 has 0xbad (not 0x2bad).  All other models on c1sus, c1ioo, c1iscex and c1iscey all have 0x2bad.

Also, no models on those computers have 'heartbeats'.

C1x02 has "NO SYNC", but all other IOPs are fine.

I've tried rebooting c1sus, restarting the daqd process on fb, all to no avail.  I can ssh / ping all of the computers, but not get the models running.  Restarting the models also doesn't help.

upside_down_cat-t2.jpg

c1iscex's IOP dmesg:

[   38.626001] c1x01: Triggered the ADC
[   39.626001] c1x01: timeout 0 1000000
[   39.626001] c1x01: exiting from fe_code()

c1ioo's IOP has the same ADC channel hopping error as c1sus'.

 

  7922   Sat Jan 19 18:23:31 2013 ranaUpdateComputersAll front ends but c1lsc are down

After sshing into several machines and doing 'sudo shutdown -r now', some of them came back and ran their processes.

After hitting the reset button on the RFM switch, their diagnostic lights came back. After restarting the Dolphin task on fb:

"sudo /etc/init.d/dis_networkmgr restart"

the Dolphin diagnostic lights came up green on the FE status screen.

iscex still wouldn't come up. The awgtpman tasks on there keep trying to start but then stop due to not finding ADCs.

 

Then power cycled the IO Chassis for EX and then awtpman log files changed, but still no green lights. Then tried a soft reboot on fb and now its not booting correctly.

Hardware lights are on, but I can't telnet into it. Tried power cycling it once or twice, but no luck.

Probably Jamie will have to hook up a keyboard and monitor to it, to find out why its not booting.

FE.png

P.S. The snapshot scripts in the yellow button don't work and the MEDM screen itself is missing the time/date string on the top.

  7963   Wed Jan 30 13:50:27 2013 JenneUpdateComputersc1iscex still down

[Koji, Jenne]

We noticed that the iscex computer is still down, but the IOP is (was) running.  When we sat down to look at it, c1x01 was 'breathing', had a non-zero CPU_METER time, and the error was 0x4000, which I've never seen before.  The fb connection was still red though.  Also, it is claiming that its sync source is 1pps, not TDS like it usually is. 

Since things were different, Koji restarted the 2 other models running on iscex, with no resulting change.  We then did a 'rtcds restart all', and the IOP is no longer breathing, and the error message has changed to 0xbad.  The sync source is still 1pps.

Moral of the story:  c1iscex is still down, but temporarily showed signs of life that we wanted to record.

  7970   Thu Jan 31 10:23:39 2013 JamieUpdateComputersc1iscex still down

Quote:

[Koji, Jenne]

We noticed that the iscex computer is still down, but the IOP is (was) running.  When we sat down to look at it, c1x01 was 'breathing', had a non-zero CPU_METER time, and the error was 0x4000, which I've never seen before.  The fb connection was still red though.  Also, it is claiming that its sync source is 1pps, not TDS like it usually is. 

Since things were different, Koji restarted the 2 other models running on iscex, with no resulting change.  We then did a 'rtcds restart all', and the IOP is no longer breathing, and the error message has changed to 0xbad.  The sync source is still 1pps.

Moral of the story:  c1iscex is still down, but temporarily showed signs of life that we wanted to record.

There's definitely a timing issue with this machine.  I looked at it a bit yesterday.  I'll try to get to it by the end of the week.

  8036   Fri Feb 8 12:43:26 2013 yutaUpdateComputersvideocapture.py now supports movie capturing

I updated /opt/rtcds/caltech/c1/scripts/general/videoscripts.py so that it supports movie capturing. It saves captured images (bmp) and movies (mp4) in /users/sensoray/SensorayCaptures/ directory.
I also updated /opt/rtcds/caltech/c1/scripts/pylibs/pyndslib.py because /usr/bin/lalapps_tconvert is not working and now /usr/bin/tconvert works.
However, tconvert doesn't run on ottavia, so I need Jamie to fix it.

videocapture.py -h:
Usage:
    videocapture.py [cameraname] [options]

Example usage:
    videocapture.py MC2F -s 320x240 -t off
       (Camptures image of MC2F with the size of 320x240, without timestamp on the image. MUST RUN ON PIANOSA!)
    videocapture.py AS -m 10
       (Camptures 10 sec movie of AS with the size of 720x480. MUST RUN ON PIANOSA!)


Options:
  -h, --help          show this help message and exit
  -s SIZE             specify image size [default: 720x480]
  -t TIMESTAMP_ONOFF  timestamp on or off [default: on]
  -m MOVLENGTH        specity movie length (in sec; takes movie if specified) [default: 0]

  8062   Mon Feb 11 18:44:34 2013 JamieUpdateComputerspasswerdz changed

Quote:

Be Prepared

http://xkcd.com/936/

Password for nodus and all control room workstations has been changed.  Look for new one in usual place.

We will try to change the password on all the RTS machines soon.  For the moment, though, they remain with the old passwerd.

  8088   Fri Feb 15 15:21:07 2013 JamieUpdateComputersc1iscex IO-chassis dead

I appears that the c1iscex IO-chassis is either dead or in a very bad state.  The PCIe interface card in the IO-chassis is showing four red lights, where it's supposed to be showing a dozen or so green lights.  Obviously this is going to prevent anything from running.

We've had power issues with this chassis before, so possibly that's what we're running into now.  I'll pull the chassis and diagnose asap.

 

  8140   Fri Feb 22 20:28:17 2013 JamieUpdateComputerslinux1 dead, then undead

At around 2:30pm today something brought down most of the martian network.  All control room workstations, nodus, etc. were unresponsive.  After poking around for a bit I finally figured it had to be linux1, which serves the NFS filesystem for all the important CDS stuff.  linux1 was indeed completely unresponsive.

Looking closer I noticed that the Fibrenetix FX-606-U4 SCSI hardware RAID device connected to linux1 (see #1901), which holds cds network filesystem, was showing "IDE Channel #4 Error Reading" on it's little LCD display.  I assumed this was the cause of the linux1 crash.

I hard shutdown linux1, and powered off the Fibrenetix device.  I pulled the disk from slot 4 and replaced it with one of the spares we had in the control room cabinets.  I powered the device back up and it beeped for a while.  Unfortunately the device was requiring a password to access it from the front panel, and I could find no manual for the device in the lab, nor does the manufacturer offer the manual on it's web site.

Eventually I was able to get linux1 fully rebooted (after some fscks) and it seemed to mount the hardware RAID (as /dev/sdc1) fine.  The brought the NFS back.  I had to reboot nodus to get it recovered, but all the control room and front-end linux machines seemed to recover on their own (although the front-ends did need an mxstream restart).

The remaining problem is that the linux1 hardware RAID device is still currently unaccessible, and it's not clear to me that it's actually synced the new disk that I put in it.  In other words I have very little confidence that we actually have an operational RAID for /opt/rtcds.  I've contacted the LDAS guys (ie. Dan Kozak) who are managing the 40m backup to confirm that the backup is legit.  In the mean time I'm going to spec out some replacement disks onto which to copy /opt/rtcds, and also so that we can get rid of this old SCSI RAID thing.

Attachment 1: FX-606-U4_1205.pdf
FX-606-U4_1205.pdf
  8141   Sat Feb 23 00:34:28 2013 yutaUpdateComputerscrontab in op340m deleted and restored (maybe)

I accidentally overwrote crontab in op340m with an empty file.
By checking /var/cron in op340m, I think I restored it.

But somehow, autolockMCmain40m does not work in cron job, so it is currently running by nohup.

What I did:
  1. I ssh-ed op340m to edit crontab to change MC autolocker to usual power mode. I used "crontab -e", but it did not show anything. I exited emacs and op340m.
  2. Rana found that the file size of crontab went 0 when I did "crontab -e".
  3. I found my elog #6899 and added one line to crontab

55 * * * *  /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/MC/autolockMCmain40m >/cvs/cds/caltech/logs/scripts/mclock.cronlog 2>&1

  4. It didn't run correctly, so Rana used his hidden power "nohup" to run autolockMCmain40m in background.
  5. Koji's hidden magic "/var/cron/log" gave me inspiration about what was in crontab. So, I made a new crontab in op340m like this;

34 * * * *  /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/MC/autolockMCmain40m >/cvs/cds/caltech/logs/scripts/mclock.cronlog 2>&1
55 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/RCthermalPID.pl >/cvs/cds/caltech/logs/scripts/RCthermalPID.cronlog 2>&1
07 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/FSSSlowServo >/cvs/cds/caltech/logs/scripts/FSSslow.cronlog 2>&1
00 * * * * /opt/rtcds/caltech/c1/burt/autoburt/burt.cron >> /opt/rtcds/caltech/c1/burt/burtcron.log
13 * * * * /cvs/cds/caltech/conlog/bin/check_conlogger_and_restart_if_dead
14,44 * * * * /opt/rtcds/caltech/c1/scripts/SUS/rampdown.pl > /dev/null 2>&1


  6. It looks like some of them started running, but I haven't checked if they are working or not. We need to look into them.

Moral of the story:
  crontab needs backup.

  8144   Sat Feb 23 14:04:07 2013 KojiUpdateComputersapache retarted (Re: linux1 dead, then undead)

apache has been restarted.
How to: search "apache" on the 40m wiki

Quote:

I had to reboot nodus to get it recovered

 

  8146   Sat Feb 23 15:26:26 2013 yutaUpdateComputerscrontab in op340m updated

I found some daily cron jobs for op340m I missed last night. Also, I edited timings of hourly jobs to maintain consistency with the past. Some of them looks old, but I will leave as it is for now.
At least, burt, FSSSlowServo and autolockMCmain40m seems like they are working now.
If you notice something is missing, please add it to crontab.

07 * * * * /opt/rtcds/caltech/c1/burt/autoburt/burt.cron >> /opt/rtcds/caltech/c1/burt/burtcron.log
13 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/FSSSlowServo >/cvs/cds/caltech/logs/scripts/FSSslow.cronlog 2>&1
14,44 * * * * /cvs/cds/caltech/conlog/bin/check_conlogger_and_restart_if_dead
15,45 * * * * /opt/rtcds/caltech/c1/scripts/SUS/rampdown.pl > /dev/null 2>&1
55 * * * *  /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/MC/autolockMCmain40m >/cvs/cds/caltech/logs/scripts/mclock.cronlog 2>&1
59 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/RCthermalPID.pl >/cvs/cds/caltech/logs/scripts/RCthermalPID.cronlog 2>&1

00 0 * * * /var/scripts/ntp.sh > /dev/null 2>&1
00 4 * * * /opt/rtcds/caltech/c1/scripts/RGA/RGAlogger.cron >> /cvs/cds/caltech/users/rward/RGA/RGAcron.out 2>&1
00 6 * * * /cvs/cds/scripts/backupScripts.pl
00 7 * * * /opt/rtcds/caltech/c1/scripts/AutoUpdate/update_conlog.cron

  8147   Sat Feb 23 15:46:16 2013 ranaUpdateComputerscrontab in op340m updated

According to Google, you can add a line in the crontab to backup the crontab by having the cronback.py script be in the scripts/ directory. It needs to save multiple copies, or else when someone makes the file size zero it will just write a zero size file onto the old backup.

  8181   Wed Feb 27 11:22:54 2013 yutaUpdateComputersbackup crontab

I made a simple script to backup crontab (/opt/rtcds/caltech/c1/scripts/crontab/backupCrontab).

#!/bin/bash

crontab -l > /opt/rtcds/caltech/c1/scripts/crontab/crontab_$(hostname).$(date '+%Y%m%d%H%M%S')


I put this script into op340m crontab.

00 8 * * * /opt/rtcds/caltech/c1/scripts/crontab/backupCrontab

It took me 30 minutes to write and check this one line script. I hate shell scripts.

  8266   Mon Mar 11 10:20:36 2013 Max HortonSummaryComputersAttempted Smart UPS 2200 Battery Replacement

Attempted Battery Replacement on Backup Power Supply in the Control Room:

I tried to replace the batteries in the Smart UPS 2200 with new batteries purchased by Steve.  However, the power port wasn't compatible with the batteries.  The battery cable's plug was too tall to fit properly into the Smart UPS port.  New batteries must be acquired.  Steve has pictures of the original battery (gray) and the new battery (blue) plugs, which look quite different (even though the company said the battery would fit).

The Correct battery connector is GRAY : APC RBC55

Attachment 1: upsB.jpg
upsB.jpg
Attachment 2: upsBa.jpg
upsBa.jpg
  8274   Tue Mar 12 00:35:56 2013 JenneUpdateComputersFB's RAID is beeping

[Manasa, Jenne]

Manasa just went inside to recenter the AS beam on the camera after our Yarm spot centering exercises of the evening, and heard a loud beeping.  We determined that it is the RAID attached to the framebuilder, which holds all of our frame data that is beeping incessantly.  The top center power switch on the back (there are FOUR power switches, and 3 power cables, btw.  That's a lot) had a red light next to it, so I power cycled the box.  After the box came back up, it started beeping again, with the same front panel message:

H/W monitor power #1 failed.

Right now the fb is trying to stay connected to things, and we can kind of use dataviewer, but we lose our connection to the framebuilder every ~30 seconds or so.  This rough timing estimate comes from how often we see the fb-related lights on the frontend status screen cycle from green to white to red back to green (or, how long do the lights stay green before going white again).  We weren't having trouble before the RAID went down a few minutes ago, so I'm hopeful that once that's fixed, the fb will be fine. 

In other news, just to make Jamie's day a little bit better, Dataviewer does not open on Pianosa or Rosalba.  The window opens, but it stays a blank grey box.  This has been going on for Pianosa for a few days, but it's new (to me at least) on Rosalba.  This is different from the lack of ability to connect to the fb that Rossa and Ottavia are seeing.

  8278   Tue Mar 12 12:06:22 2013 JamieUpdateComputersFB recovered, RAID power supply #1 dead

The framebuilder RAID is back online.  The disk had been mounted read-only (see below) so daqd couldn't write frames, which was in turn causing it to segfault immediately, so it was constantly restarting.

The jetstor RAID unit itself has a dead power supply.  This is not fatal, since it has three.  It has three so it can continue to function if one fails.  I have removed the bad supply and gave it to Steve so he can get a suitable replacement.

Some recovery had to be done on fb to get everything back up and running again.  I ran into issues trying to do it on the fly, so I eventually just rebooted.  It seemed to come back ok, except for something going on with daqd.  It was reporting the following error upon restart:

[Tue Mar 12 11:43:54 2013] main profiler warning: 0 empty blocks in the buffer

It was spitting out this message about once a second, until eventually the daqd died.  When it restarted it seemed to come back up fine.  I'm not exactly clear what those messages were about, but I think it has something to do with not being able to dump it's data buffers to disk.  I'm guessing that this was a residual problem from the umounted /frames, which somehow cleared on it's own.  Everything seems to be ok now.

Quote:

Manasa just went inside to recenter the AS beam on the camera after our Yarm spot centering exercises of the evening, and heard a loud beeping.  We determined that it is the RAID attached to the framebuilder, which holds all of our frame data that is beeping incessantly.  The top center power switch on the back (there are FOUR power switches, and 3 power cables, btw.  That's a lot) had a red light next to it, so I power cycled the box.  After the box came back up, it started beeping again, with the same front panel message:

H/W monitor power #1 failed.

DO NOT DO THIS.  This is what caused all the problems.  The unit has three redundant power supplies, for just this reason.  It was probably continuing to function fine.  The beeping was just to tell you that there was something that needed attention.  Rebooting the device does nothing to solve the problem.  Rebooting in an attempt to silence beeping is not a solution.  Shutting of the RAID unit is basically the equivalent of ripping out a mounted external USB drive.  You can damage the filesystem that way.  The disk was still functioning properly.  As far as I understand it the only problem was the beeping, and there were no other issues.  After you hard rebooted the device, fb lost it's mounted disk and then went into emergency mode, which was to remount the disk read-only.  It didn't understand what was going on, only that the disk seemed to disappear and the reappear.  This was then what caused the problems.  It was not the beeping, it was the restarting the RAID that was mounted on fb.

Computers are not like regular pieces of hardware.  You can't just yank the power on them.  Worse yet is yanking the power on a device that is connected to a computer.  DON"T DO THIS UNLESS YOU KNOW WHAT YOU"RE DOING.  If the device is a disk drive, then doing this is a sure-fire way to damage data on disk.

 

  8280   Tue Mar 12 14:51:00 2013 SteveUpdateComputersbuy warranty or not ?

 Details of the warranties are posted on wiki power supply cost, warranty described, cost

.......I’ve also attached a warranty renewal quote.  A 1 year warranty renewal is usually $.... per year, but we gave you special pricing of $.... / year if you renew both units.  This pricing is also special due to the fact that both warranties expired awhile ago.  We usually require that the warranty renewal begin on the date of expiration, but we will waive this for you this time if both are renewed.

 

JetStor SATA 416S, SN: SB09040111A3 – expired 04/24/2012 (3 years old)

 

JetStor SATA 516F, SN: SB09080016P – expired on 08/21/2012........

 

. Are we keep it for an other 2 years? buy warranty or buy better storage.

 

  8324   Thu Mar 21 10:29:12 2013 ManasaUpdateComputersComputers down since last night

I'm trying to figure out what went wrong last night. But the morning status...the computers are down.

 

down.png

Attachment 1: down.png
down.png
  8325   Thu Mar 21 12:04:05 2013 ManasaUpdateComputersFixed

All FE computers are back.

Restart procedure:

0a. Restart frame builder: telnet fb 8087 & type shutdown

0b. Restart mx_stream from the FE overview screen

1. I ssh ed to the computer. (c1lsc, c1ioo, c1iscex, c1isey)

2. I used 'sudo shutdown -r (computername)'. They came back ON.

3. While rebooting c1ioo, c1sus shutdown (for reasons I don't know). I could not ping or ssh c1SUS after this.

4. I went in and switched c1SUS computer OFF and back ON after which I could ssh to it.

5. I did the same reboot procedure for c1SUS.

6. I had to restart some of the models individually.
    (i) ssh to the computer running the model
    (ii) rtcds restart 'model name'

7. All computers are back now.

up.png

  8326   Thu Mar 21 12:33:51 2013 ranaUpdateComputersFixed

Please stop power cycling computers. This is not an acceptable operation (as Jamie already wrote before). When you don't know what to do besides power cycling the computer, just stop and do something else or call someone who knows more. Every time you kill the power to a computer you are taking a chance on damaging it or corrupting some hard drive.

In this case, the right thing to do would be to hook up the external keyboard and monitor directly to c1sus to diagnose things.

NO MORE TOUCHING THE POWER BUTTON.

  8334   Mon Mar 25 09:52:22 2013 JenneUpdateComputersc1lsc mxstream won't restart

Most of the front ends' mx streams weren't running, so I did the old mxstreamrestart on all machines (see elog 6574....the dmesg on c1lsc right now, at the top, has similar messages).  Usually this mxstream restart works flawlessly, but today c1lsc isn't working.  Usually to the right side of the terminal window I get an [ok] when things work.  For the lsc machine today, I get [!!] instead. 

After having learned from recent lessons, I'm waiting to hear from Jamie.

  8335   Mon Mar 25 11:42:45 2013 JamieUpdateComputersc1lsc mx_stream ok

I'm not exactly sure what the problem was here, but I think it had to do with a stuck mx_stream process that wasn't being killed properly.  I manually killed the process and it seemed to come up fine after that.  The regular restart mechanisms should work now.

No idea what caused the process to hang in the first place, although I know the newer RCG (2.6) is supposed to address some of these mx_stream issues.

  8366   Thu Mar 28 10:44:30 2013 ManasaUpdateComputersc1lsc down

c1lsc was down this morning.

I restarted fb and c1lsc based on elog

Everything but c1oaf came back. I tried to restart c1oaf individually; but it didn't work.

Before:

cds_FE.png

After:

cds_fe1.png

  8367   Thu Mar 28 12:50:52 2013 JenneUpdateComputersc1lsc is fine

 Manasa told me that she did things in a different order than her old elog. 

She had

(1) ssh'ed to c1lsc and did a remote shutdown / restart,

(2) restarted fb,

(3) restarted the mxstream on c1lsc,

(4) restarted each model individually in some order that I forgot to ask.

However, with the situation as in her "before" screenshot, all that needed to be done was restart the mxstream process on c1lsc. 

Anyhow, when I looked at the OAF model, it was complaining of "no sync", so I restarted the model, and it came back up fine.  All is well again.

  8374   Fri Mar 29 17:24:43 2013 JamieUpdateComputersFB RAID power supply replaced

Steve ordered a replacement power supply for the FB JetStor power supply that failed a couple weeks ago.  I just installed it and it looks fine.

  8394   Tue Apr 2 20:52:35 2013 ranaUpdateComputersiMac bashed

 I changed the default shell on our control room iMac to bash. Since we're really, really using bash as the shell for LIGO, we might as well get used to it. As we do this for the workstations, some things will fail, but we can adopt Jamie's private .bashrc to get started and then fix it up later.

  8398   Wed Apr 3 01:32:04 2013 JenneUpdateComputersupdated EPICS database (channels selected for saving)

I modified /opt/rtcds/caltech/c1/chans/daq/C0EDCU.ini to include the C1:LSC-DegreeOfFreedom_TRIG_MON channels.  These are the same channel that cause the LSC screen trigger indicators to light up. 

I vaguely followed Koji's directions in elog 5991, although I didn't add new grecords, since these channels are already included in the .db file as a result of EpicsOut blocks in the simulink model.  So really, I only did Step 2.  I still need to restart the framebuilder, but locking (attempt at locking) is happening.

The idea here is that we should be able to search through this channel, and when we get a trigger, we can go back and plot useful signals (PDs, error signals, cotrol signals,....), and try to figure out why we're losing lock. 

Rana tells me that this is similar to an old LockAcq script that would run DTT and get data.

EDIT:  I restarted the daqd on the fb, and I now see the channel in dataviewer, but I can only get live data, no past data, even though it says that it is (16,float).  Here's what Dataviewer is telling me:

Connecting to NDS Server fb (TCP port 8088)
Connecting.... done
read(); errno=0
LONG: DataRead = -1
No data found

read(); errno=9
read(); errno=9
T0=13-03-29-08-59-43; Length=432010 (s)
No data output.

 

  8400   Wed Apr 3 14:45:34 2013 JamieUpdateComputersupdated EPICS database (channels selected for saving)

Quote:

I modified /opt/rtcds/caltech/c1/chans/daq/C0EDCU.ini to include the C1:LSC-DegreeOfFreedom_TRIG_MON channels.  These are the same channel that cause the LSC screen trigger indicators to light up. 

I vaguely followed Koji's directions in elog 5991, although I didn't add new grecords, since these channels are already included in the .db file as a result of EpicsOut blocks in the simulink model.  So really, I only did Step 2.  I still need to restart the framebuilder, but locking (attempt at locking) is happening.

The idea here is that we should be able to search through this channel, and when we get a trigger, we can go back and plot useful signals (PDs, error signals, cotrol signals,....), and try to figure out why we're losing lock. 

Rana tells me that this is similar to an old LockAcq script that would run DTT and get data.

EDIT:  I restarted the daqd on the fb, and I now see the channel in dataviewer, but I can only get live data, no past data, even though it says that it is (16,float).  Here's what Dataviewer is telling me:

Connecting to NDS Server fb (TCP port 8088)
Connecting.... done
read(); errno=0
LONG: DataRead = -1
No data found

read(); errno=9
read(); errno=9
T0=13-03-29-08-59-43; Length=432010 (s)
No data output.
 

I seem to be able to retrieve these channels ok from the past:

controls@pianosa:/opt/rtcds/caltech/c1/scripts 0$ tconvert 1049050000
Apr 03 2013 18:46:24 UTC
controls@pianosa:/opt/rtcds/caltech/c1/scripts 0$ ./general/getdata -s 1049050000 -d 10 --noplot C1:LSC-PRCL_TRIG_MON
Connecting to server fb:8088 ...
nds_logging_init: Entrynds_logging_init: Exit
fetching... 1049050000.0
Hit any key to exit: 
controls@pianosa:/opt/rtcds/caltech/c1/scripts 0$ 

Maybe DTT just needed to be reloaded/restarted?

  8444   Thu Apr 11 11:58:21 2013 JenneUpdateComputersLSC whitening c-code ready

The big hold-up with getting the LSC whitening triggering ready has been a problem with running the c-code on the front end models.  That problem has now been solved (Thanks Alex!), so I can move forward.

The background:

We want the RFPD whitening filters to be OFF while in acquisition mode, but after we lock, we want to turn the analog whitening (and the digital compensation) ON.  The difference between this and the other DoF and filter module triggers is that we must parse the input matrix to see which PD is being used for locking at that time.  It is the c-code that parses this matrix that has been causing trouble.  I have been testing this code on the c1tst.mdl, which runs on the Y-end computer.  Every time I tried to compile and run the c1tst model, the entire Y-end computer would crash.

The solution:

Alex came over to look at things with Jamie and me.  In the 2.5 version of the RCG (which we are still using), there is an optimization flag "-O3" in the make file.  This optimization, while it can make models run a little faster, has been known in the past to cause problems.  Here at the 40m, our make files had an if-statement, so that the c1pem model would compile using the "-O" optimization flag instead, so clearly we had seen the problem here before, probably when Masha was here and running the neural network code on the pem model.  In the RCG 2.6 release, all models are compiled using the "-O" flag.  We tried compiling the c1tst model with this "-O" optimization, and the model started and the computer is just fine.  This solved the problem.

Since we are going to upgrade to RCG 2.6 in the near-ish future anyway, Alex changed our make files so that all models will now compile with the "-O" flagWe should monitor other models when we recompile them, to make sure none of them start running long with the different optimization. 

The future:

Implement LSC whitening triggering!

  8479   Tue Apr 23 22:10:54 2013 ranaUpdateComputersNancy

controls@rosalba:/users/rana/docs 0$ svn resolve --accept working nancy
Resolved conflicted state of 'nancy'

  8529   Sat May 4 00:21:00 2013 ranaConfigurationComputersworkstation updates

 Koji and I went into "Update Manager" on several of the Ubuntu workstations and unselected the "check for updates" button. This is to prevent the machines from asking to be upgraded so frequently - I am concerned that someone might be tempted to upgrade the workstations to Ubuntu 12.

We didn't catch them all, so please take a moment to check that this is the case on all the laptops you are using and make it so. We can then apply the updates in a controlled manner once every few months.

ELOG V3.1.3-