40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 92 of 341  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  8278   Tue Mar 12 12:06:22 2013 JamieUpdateComputersFB recovered, RAID power supply #1 dead

The framebuilder RAID is back online.  The disk had been mounted read-only (see below) so daqd couldn't write frames, which was in turn causing it to segfault immediately, so it was constantly restarting.

The jetstor RAID unit itself has a dead power supply.  This is not fatal, since it has three.  It has three so it can continue to function if one fails.  I have removed the bad supply and gave it to Steve so he can get a suitable replacement.

Some recovery had to be done on fb to get everything back up and running again.  I ran into issues trying to do it on the fly, so I eventually just rebooted.  It seemed to come back ok, except for something going on with daqd.  It was reporting the following error upon restart:

[Tue Mar 12 11:43:54 2013] main profiler warning: 0 empty blocks in the buffer

It was spitting out this message about once a second, until eventually the daqd died.  When it restarted it seemed to come back up fine.  I'm not exactly clear what those messages were about, but I think it has something to do with not being able to dump it's data buffers to disk.  I'm guessing that this was a residual problem from the umounted /frames, which somehow cleared on it's own.  Everything seems to be ok now.

Quote:

Manasa just went inside to recenter the AS beam on the camera after our Yarm spot centering exercises of the evening, and heard a loud beeping.  We determined that it is the RAID attached to the framebuilder, which holds all of our frame data that is beeping incessantly.  The top center power switch on the back (there are FOUR power switches, and 3 power cables, btw.  That's a lot) had a red light next to it, so I power cycled the box.  After the box came back up, it started beeping again, with the same front panel message:

H/W monitor power #1 failed.

DO NOT DO THIS.  This is what caused all the problems.  The unit has three redundant power supplies, for just this reason.  It was probably continuing to function fine.  The beeping was just to tell you that there was something that needed attention.  Rebooting the device does nothing to solve the problem.  Rebooting in an attempt to silence beeping is not a solution.  Shutting of the RAID unit is basically the equivalent of ripping out a mounted external USB drive.  You can damage the filesystem that way.  The disk was still functioning properly.  As far as I understand it the only problem was the beeping, and there were no other issues.  After you hard rebooted the device, fb lost it's mounted disk and then went into emergency mode, which was to remount the disk read-only.  It didn't understand what was going on, only that the disk seemed to disappear and the reappear.  This was then what caused the problems.  It was not the beeping, it was the restarting the RAID that was mounted on fb.

Computers are not like regular pieces of hardware.  You can't just yank the power on them.  Worse yet is yanking the power on a device that is connected to a computer.  DON"T DO THIS UNLESS YOU KNOW WHAT YOU"RE DOING.  If the device is a disk drive, then doing this is a sure-fire way to damage data on disk.

 

  8280   Tue Mar 12 14:51:00 2013 SteveUpdateComputersbuy warranty or not ?

 Details of the warranties are posted on wiki power supply cost, warranty described, cost

.......I’ve also attached a warranty renewal quote.  A 1 year warranty renewal is usually $.... per year, but we gave you special pricing of $.... / year if you renew both units.  This pricing is also special due to the fact that both warranties expired awhile ago.  We usually require that the warranty renewal begin on the date of expiration, but we will waive this for you this time if both are renewed.

 

JetStor SATA 416S, SN: SB09040111A3 – expired 04/24/2012 (3 years old)

 

JetStor SATA 516F, SN: SB09080016P – expired on 08/21/2012........

 

. Are we keep it for an other 2 years? buy warranty or buy better storage.

 

  8324   Thu Mar 21 10:29:12 2013 ManasaUpdateComputersComputers down since last night

I'm trying to figure out what went wrong last night. But the morning status...the computers are down.

 

down.png

Attachment 1: down.png
down.png
  8325   Thu Mar 21 12:04:05 2013 ManasaUpdateComputersFixed

All FE computers are back.

Restart procedure:

0a. Restart frame builder: telnet fb 8087 & type shutdown

0b. Restart mx_stream from the FE overview screen

1. I ssh ed to the computer. (c1lsc, c1ioo, c1iscex, c1isey)

2. I used 'sudo shutdown -r (computername)'. They came back ON.

3. While rebooting c1ioo, c1sus shutdown (for reasons I don't know). I could not ping or ssh c1SUS after this.

4. I went in and switched c1SUS computer OFF and back ON after which I could ssh to it.

5. I did the same reboot procedure for c1SUS.

6. I had to restart some of the models individually.
    (i) ssh to the computer running the model
    (ii) rtcds restart 'model name'

7. All computers are back now.

up.png

  8326   Thu Mar 21 12:33:51 2013 ranaUpdateComputersFixed

Please stop power cycling computers. This is not an acceptable operation (as Jamie already wrote before). When you don't know what to do besides power cycling the computer, just stop and do something else or call someone who knows more. Every time you kill the power to a computer you are taking a chance on damaging it or corrupting some hard drive.

In this case, the right thing to do would be to hook up the external keyboard and monitor directly to c1sus to diagnose things.

NO MORE TOUCHING THE POWER BUTTON.

  8334   Mon Mar 25 09:52:22 2013 JenneUpdateComputersc1lsc mxstream won't restart

Most of the front ends' mx streams weren't running, so I did the old mxstreamrestart on all machines (see elog 6574....the dmesg on c1lsc right now, at the top, has similar messages).  Usually this mxstream restart works flawlessly, but today c1lsc isn't working.  Usually to the right side of the terminal window I get an [ok] when things work.  For the lsc machine today, I get [!!] instead. 

After having learned from recent lessons, I'm waiting to hear from Jamie.

  8335   Mon Mar 25 11:42:45 2013 JamieUpdateComputersc1lsc mx_stream ok

I'm not exactly sure what the problem was here, but I think it had to do with a stuck mx_stream process that wasn't being killed properly.  I manually killed the process and it seemed to come up fine after that.  The regular restart mechanisms should work now.

No idea what caused the process to hang in the first place, although I know the newer RCG (2.6) is supposed to address some of these mx_stream issues.

  8366   Thu Mar 28 10:44:30 2013 ManasaUpdateComputersc1lsc down

c1lsc was down this morning.

I restarted fb and c1lsc based on elog

Everything but c1oaf came back. I tried to restart c1oaf individually; but it didn't work.

Before:

cds_FE.png

After:

cds_fe1.png

  8367   Thu Mar 28 12:50:52 2013 JenneUpdateComputersc1lsc is fine

 Manasa told me that she did things in a different order than her old elog. 

She had

(1) ssh'ed to c1lsc and did a remote shutdown / restart,

(2) restarted fb,

(3) restarted the mxstream on c1lsc,

(4) restarted each model individually in some order that I forgot to ask.

However, with the situation as in her "before" screenshot, all that needed to be done was restart the mxstream process on c1lsc. 

Anyhow, when I looked at the OAF model, it was complaining of "no sync", so I restarted the model, and it came back up fine.  All is well again.

  8374   Fri Mar 29 17:24:43 2013 JamieUpdateComputersFB RAID power supply replaced

Steve ordered a replacement power supply for the FB JetStor power supply that failed a couple weeks ago.  I just installed it and it looks fine.

  8394   Tue Apr 2 20:52:35 2013 ranaUpdateComputersiMac bashed

 I changed the default shell on our control room iMac to bash. Since we're really, really using bash as the shell for LIGO, we might as well get used to it. As we do this for the workstations, some things will fail, but we can adopt Jamie's private .bashrc to get started and then fix it up later.

  8398   Wed Apr 3 01:32:04 2013 JenneUpdateComputersupdated EPICS database (channels selected for saving)

I modified /opt/rtcds/caltech/c1/chans/daq/C0EDCU.ini to include the C1:LSC-DegreeOfFreedom_TRIG_MON channels.  These are the same channel that cause the LSC screen trigger indicators to light up. 

I vaguely followed Koji's directions in elog 5991, although I didn't add new grecords, since these channels are already included in the .db file as a result of EpicsOut blocks in the simulink model.  So really, I only did Step 2.  I still need to restart the framebuilder, but locking (attempt at locking) is happening.

The idea here is that we should be able to search through this channel, and when we get a trigger, we can go back and plot useful signals (PDs, error signals, cotrol signals,....), and try to figure out why we're losing lock. 

Rana tells me that this is similar to an old LockAcq script that would run DTT and get data.

EDIT:  I restarted the daqd on the fb, and I now see the channel in dataviewer, but I can only get live data, no past data, even though it says that it is (16,float).  Here's what Dataviewer is telling me:

Connecting to NDS Server fb (TCP port 8088)
Connecting.... done
read(); errno=0
LONG: DataRead = -1
No data found

read(); errno=9
read(); errno=9
T0=13-03-29-08-59-43; Length=432010 (s)
No data output.

 

  8400   Wed Apr 3 14:45:34 2013 JamieUpdateComputersupdated EPICS database (channels selected for saving)

Quote:

I modified /opt/rtcds/caltech/c1/chans/daq/C0EDCU.ini to include the C1:LSC-DegreeOfFreedom_TRIG_MON channels.  These are the same channel that cause the LSC screen trigger indicators to light up. 

I vaguely followed Koji's directions in elog 5991, although I didn't add new grecords, since these channels are already included in the .db file as a result of EpicsOut blocks in the simulink model.  So really, I only did Step 2.  I still need to restart the framebuilder, but locking (attempt at locking) is happening.

The idea here is that we should be able to search through this channel, and when we get a trigger, we can go back and plot useful signals (PDs, error signals, cotrol signals,....), and try to figure out why we're losing lock. 

Rana tells me that this is similar to an old LockAcq script that would run DTT and get data.

EDIT:  I restarted the daqd on the fb, and I now see the channel in dataviewer, but I can only get live data, no past data, even though it says that it is (16,float).  Here's what Dataviewer is telling me:

Connecting to NDS Server fb (TCP port 8088)
Connecting.... done
read(); errno=0
LONG: DataRead = -1
No data found

read(); errno=9
read(); errno=9
T0=13-03-29-08-59-43; Length=432010 (s)
No data output.
 

I seem to be able to retrieve these channels ok from the past:

controls@pianosa:/opt/rtcds/caltech/c1/scripts 0$ tconvert 1049050000
Apr 03 2013 18:46:24 UTC
controls@pianosa:/opt/rtcds/caltech/c1/scripts 0$ ./general/getdata -s 1049050000 -d 10 --noplot C1:LSC-PRCL_TRIG_MON
Connecting to server fb:8088 ...
nds_logging_init: Entrynds_logging_init: Exit
fetching... 1049050000.0
Hit any key to exit: 
controls@pianosa:/opt/rtcds/caltech/c1/scripts 0$ 

Maybe DTT just needed to be reloaded/restarted?

  8444   Thu Apr 11 11:58:21 2013 JenneUpdateComputersLSC whitening c-code ready

The big hold-up with getting the LSC whitening triggering ready has been a problem with running the c-code on the front end models.  That problem has now been solved (Thanks Alex!), so I can move forward.

The background:

We want the RFPD whitening filters to be OFF while in acquisition mode, but after we lock, we want to turn the analog whitening (and the digital compensation) ON.  The difference between this and the other DoF and filter module triggers is that we must parse the input matrix to see which PD is being used for locking at that time.  It is the c-code that parses this matrix that has been causing trouble.  I have been testing this code on the c1tst.mdl, which runs on the Y-end computer.  Every time I tried to compile and run the c1tst model, the entire Y-end computer would crash.

The solution:

Alex came over to look at things with Jamie and me.  In the 2.5 version of the RCG (which we are still using), there is an optimization flag "-O3" in the make file.  This optimization, while it can make models run a little faster, has been known in the past to cause problems.  Here at the 40m, our make files had an if-statement, so that the c1pem model would compile using the "-O" optimization flag instead, so clearly we had seen the problem here before, probably when Masha was here and running the neural network code on the pem model.  In the RCG 2.6 release, all models are compiled using the "-O" flag.  We tried compiling the c1tst model with this "-O" optimization, and the model started and the computer is just fine.  This solved the problem.

Since we are going to upgrade to RCG 2.6 in the near-ish future anyway, Alex changed our make files so that all models will now compile with the "-O" flagWe should monitor other models when we recompile them, to make sure none of them start running long with the different optimization. 

The future:

Implement LSC whitening triggering!

  8479   Tue Apr 23 22:10:54 2013 ranaUpdateComputersNancy

controls@rosalba:/users/rana/docs 0$ svn resolve --accept working nancy
Resolved conflicted state of 'nancy'

  8529   Sat May 4 00:21:00 2013 ranaConfigurationComputersworkstation updates

 Koji and I went into "Update Manager" on several of the Ubuntu workstations and unselected the "check for updates" button. This is to prevent the machines from asking to be upgraded so frequently - I am concerned that someone might be tempted to upgrade the workstations to Ubuntu 12.

We didn't catch them all, so please take a moment to check that this is the case on all the laptops you are using and make it so. We can then apply the updates in a controlled manner once every few months.

  8540   Tue May 7 17:43:51 2013 JamieUpdateComputers40MARS wireless network problems

I'm not sure what's going on today but we're seeing ~80% packet loss on the 40MARS wireless network.  This is obviously causing big problems for all of our wirelessly connected machines.  The wired network seems to be fine.

I've tried power cycling the wireless router but it didn't seem to help.  Not sure what's going on, or how it got this way.  Investigating...

  8541   Tue May 7 18:16:37 2013 JamieUpdateComputers40MARS wireless network problems

Here's an example of the total horribleness of what's happening right now:

controls@rossa:~ 0$ ping 192.168.113.222
PING 192.168.113.222 (192.168.113.222) 56(84) bytes of data.
From 192.168.113.215 icmp_seq=2 Destination Host Unreachable
From 192.168.113.215 icmp_seq=3 Destination Host Unreachable
From 192.168.113.215 icmp_seq=4 Destination Host Unreachable
From 192.168.113.215 icmp_seq=5 Destination Host Unreachable
From 192.168.113.215 icmp_seq=6 Destination Host Unreachable
From 192.168.113.215 icmp_seq=7 Destination Host Unreachable
From 192.168.113.215 icmp_seq=9 Destination Host Unreachable
From 192.168.113.215 icmp_seq=10 Destination Host Unreachable
From 192.168.113.215 icmp_seq=11 Destination Host Unreachable
64 bytes from 192.168.113.222: icmp_seq=12 ttl=64 time=10341 ms
64 bytes from 192.168.113.222: icmp_seq=13 ttl=64 time=10335 ms
^C
--- 192.168.113.222 ping statistics ---
35 packets transmitted, 2 received, +9 errors, 94% packet loss, time 34021ms
rtt min/avg/max/mdev = 10335.309/10338.322/10341.336/4.406 ms, pipe 11
controls@rossa:~ 0$ 

Note that 10 SECOND round trip time and 94% packet loss.  That's just beyond stupid.  I have no idea what's going on.

  12721   Mon Jan 16 12:49:06 2017 ranaConfigurationComputersMegatron update

The "apt-get update" was failing on some machines because it couldn't find the 'Debian squeeze' repos, so I made some changes so that Megatron could be upgraded.

I think Jamie set this up for us a long time ago, but now the LSC has stopped supporting these versions of the software. We're running Ubuntu12 and 'squeeze' is meant to support Ubuntu10. Ubuntu12 (which is what LLO is running) corresponds to 'Debian-wheezy' and Ubuntu14 to 'Debian-Jessie' and Ubuntu16 to 'debian-stretch'.

We should consider upgrading a few of our workstations to Ubuntu 14 LTS to see how painful it is to run our scripts and DTT and DV. Better to upgrade a bit before we are forced to by circumstance.

I followed the instructions from software.ligo.org (https://wiki.ligo.org/DASWG/DebianWheezy) and put the recommended lines into the /etc/apt/sources.list.d/lsc-debian.list file.

but I still got 1 error (previously there were ~7 errors):

W: Failed to fetch http://software.ligo.org/lscsoft/debian/dists/wheezy/Release  Unable to find expected entry 'contrib/binary-i386/Packages' in Release file (Wrong sources.list entry or malformed file)

Restarting now to see if things work. If its OK, we ought to change our squeeze lines into wheezy for all workstations so that our LSC software can be upgraded.

  12724   Mon Jan 16 22:03:30 2017 jamieConfigurationComputersMegatron update
Quote:
 

We should consider upgrading a few of our workstations to Ubuntu 14 LTS to see how painful it is to run our scripts and DTT and DV. Better to upgrade a bit before we are forced to by circumstance.

I would recommend upgrading the workstations to one of the reference operating systems, either SL7 or Debian squeeze, since that's what the sites are moving towards.  If you do that you can just install all the control room software from the supported repos, and not worry about having to compile things from source anymore.

  12849   Thu Feb 23 15:48:43 2017 johannesUpdateComputersc1psl un-bootable

Using the PDA520 detector on the AS port I tried to get some better estimates for the round-trip loss in both arms. While setting up the measurement I noticed some strange output on the scope I'm using to measure the amount of reflected light.

The interferometer was aligned using the dither scripts for both arms. Then, ITMY was majorly misaligned in pitch AND yaw such that the PD reading did not change anymore. Thus, only light reflected from the XARM was incident of the AS PD. The scope was showing strange oscillations (Channel 2 is the AS PD signal):

For the measurement we compare the DC level of the reflection with the ETM aligned (and the arm locked) vs a misaligned ETM (only ITM reflection). This ringing could be observed in both states, and was qualitatively reproducible with the other arm. It did not show up in the MC or ARM transmission. I found that changing the pitch of the 'active' ITM (=of the arm under investigation) either way by just a couple of ticks made it go away and settle roughly at the lower bound of the oscillation:

In this configuration the PD output follows the mode cleaner transmission (Channel 3 in the screen caps) quite well, but we can't take the differential measurement like this, because it is impossible to align and lock the arm but them misalign the ITM. Moving the respective other ITM for potential secondary beams did not seem to have an obvious effect, although I do suspect a ghost/secondary beam to be the culprit for this. I moved the PDA520 on the optical table but didn't see a change in the ringing amplitude. I do need to check the PD reflection though.

Obviously it will be hard to determine the arm loss this way, but for now I used the averaging function of the scope to get rid of the ringing. What this gave me was:
(16 +/- 9) ppm losses in the x-arm and (-18+/-8) ppm losses in the y-arm

The negative loss obviously makes little sense, and even the x-arm number seems a little too low to be true. I strongly suspect the ringing is responsible and wanted to investigate this further today, but a problem with c1psl came up that shut down all work on this until it is fixed:

I found the PMC unlocked this morning and c1psl (amongst other slow machines) was unresponsive, so I power-cycled them. All except c1psl came back to normal operation. The PMC transmission, as recorded by c1psl,  shows that it has been down for several days:

Repeated attempts to reset and/or power-cycle it by Gautam and myself could not bring it back. The fail indicator LED of a single daughter card (the DOUT XVME-212) turns off after reboot, all others stay lit. The sysfail LED on the crate is also on, but according to elog 10015 this is 'normal'. I'm following up that post's elog tree to monitor the startup of c1psl through its system console via a serial connection to find out what is wrong.

  12850   Thu Feb 23 18:52:53 2017 ranaUpdateComputersc1psl un-bootable

The fringes seen on the oscope are mostly likely due to the interference from multiple light beams. If there are laser beams hitting mirrors which are moving, the resultant interference signal could be modulated at several Hertz, if, for example, one of the mirrors had its local damping disabled.

  12851   Thu Feb 23 19:44:48 2017 johannesUpdateComputersc1psl un-bootable

Yes, that was one of the things that I wanted to look into. One thing Gautam and I did that I didn't mention was to reconnect the SRM satellite box and move the optic around a bit, which didn't change anything. Once the c1psl problem is fixed we'll resume with that.

Quote:

The fringes seen on the oscope are mostly likely due to the interference from multiple light beams. If there are laser beams hitting mirrors which are moving, the resultant interference signal could be modulated at several Hertz, if, for example, one of the mirrors had its local damping disabled.

 

Speaking of which:

Using one of the grey RJ45 to D-Sub cables with an RS232 to USB adapter I was able to capture the startup log of c1psl (using the usb camera windows laptop). I also logged the startup of the "healthy" c1aux, both are attached. c1psl stalls at a point were c1aux starts testing for present vme modules and doesn't continue, however is not strictly hung up, as it still registers to the logger when external login attempts via telnet occur. The telnet client simply reports that the "shell is locked" and exits. It is possible that one of the daughter cards causes this. This seems to happen after iocInit is called by the startup script at /cvs/cds/caltech/target/c1psl/startup.cmd, as it never gets to the next item "coreRelease()". Gautam and I were trying to find out what happends inside iocInit, but it's not clear to us at this point from where it is even called. iocInit.c and compiled binaries exist in several places on the shared drive. However, all belong to R3.14.x epics releases, while the logfile states that the R3.12.2 epics core is used when iocInit is called.

Next we'll interrupt the autoboot procedure and try to work with the machine directly.

Attachment 1: slow_startup_logs.tar.gz
  12852   Fri Feb 24 20:38:01 2017 johannesUpdateComputersc1psl boot-stall culprit identified

[Gautam, Johannes]

c1psl finally booted up again, PMC and IMC are locked.

Trying to identify the hickup from the source code was fruitless. However, since the PMCTRANSPD channel acqusition failure occured long before the actual slow machine crashed, and since the hickup in the boot seemed to indicate a problem with daughter module identification, we started removing the DIO and DAQ modules:

  1. Started with the ones whose fail LED stayed lit during the boot process: the DIN (XVME-212) and the three DACs (VMIVME4113). No change.
  2. Also removed the DOUT (XVME-220) and the two ADCs (VMIVME 3113A and VMIVME3123). It boots just fine and can be telnetted into!
  3. Pushed the DIN and the DACs back in. Still boots.
  4. Pushed only VMIVME3123 back in. Boot stalls again.
  5. Removed VMIVME3123, pushed VMIVME 3113A back in. Boots successfully.
  6. Left VMIVME3123 loose in the crate without electrical contact for now.
  7. Proceeded to lock PMC and IMC

The particle counter channel should be working again.

  • VMIVME3123 is a 16-Bit High-Throughput Analog Input Board, 16 Channels with Simultaneous Sample-and-Hold Inputs
  • VMIVME3113A is a Scanning 12-Bit Analog-to-Digital Converter Module with 64 channels

/cvs/cds/caltech/target/c1psl/psl.db lists the following channels for VMIVME3123:

Channels currently in use (and therefore not available in the medm screens):

  • C1:PSL-FSS_SLOW_MON
  • C1:PSL-PMC_PMCERR
  • C1:PSL-FSS_SLOWM
  • C1:PSL-FSS_MIXERM
  • C1:PSL-FSS_RMTEMP
  • C1:PSL-PMC_PMCTRANSPD

Channels not currently in use (?):

  • C1:PSL-FSS_MINCOMEAS
  • C1:PSL-FSS_RCTRANSPD
  • C1:PSL-126MOPA_126MON
  • C1:PSL-126MOPA_AMPMON
  • C1:PSL-FSS_TIDALINPUT
  • C1:PSL-FSS_TIDALSET
  • C1:PSL-FSS_RCTEMP
  • C1:PSL-PPKTP_TEMP

There are plenty of channels available on the asynchronous ADC, so we could wire the relevant ones there if we done care about the 16 bit synchronous sampling (required for proper functionality?)

Alternatively, we could prioritize the Acromag upgrade on c1psl (DAQ would still be asynchronous, though). The PCBs are coming in next Monday and the front panels on Tuesday.

 

 

Some more info that might come in handy to someone someday:

The (nameless?) Windows 7 laptop that lives near MC2 and is used for the USB microscope was used for interfacing with c1psl. No special drivers were necessary to use the USB to RS232 adapter, and the RJ45 end of the grey homemade DB9 to RJ45 cable was plugged into the top port which is labeled "console 1". I downloaded the program "CoolTerm" from http://freeware.the-meiers.org/#CoolTerm, which is a serial protocol emulator, and it worked out of the box with the adapter. The standard settings fine worked for communicating with c1psl, only a small modification was necessary: in Options>Terminal make sure that "Enter Key Emulation" is set from "CR+LF" to "CR", otherwise each time 'Enter' is pressed it is actually sent twice.

  12854   Tue Feb 28 01:28:52 2017 johannesUpdateComputersc1psl un-bootable

It turned out the 'ringing' was caused by the respective other ETM still being aligned. For these reflection measurements both test masses of the other arm need to be misaligned. For the ETM it's sufficient to use the Misalign button in the medm screens, while the ITM has to be manually misaligned to move the reflected beam off the PD.

I did another round of armloss measurements today. I encountered some problems along the way

  • Some time today (around 6pm) most of the front end models had crashed and needed to be restarted GV: actually it was only the models on c1lsc that had crashed. I noticed this on Friday too.
  • ETMX keeps getting kicked up seemingly randomly. However, it settles fast into it's original position.

General Stuff:

  • Oscilloscope should sample both MC power (from MC2 transmitted beam) and AS signal
  • Channel data can only be loaded from the scope one channel at a time, so 'stop' scope acquisition and then grab the relevant channels individually
  • Averaging needs to be restarted everytime the mirrors are moved triggering stop and run remotely via the http interface scripts does this.

Procedure:

  1.     Run LSC Offsets
  2.     With the PSL shutter closed measure scope channel dark offsets, then open shutter
  3.     Align all four test masses with dithering to make sure the IFO alignment is in a known state
  4.     Pick an arm to measure
  5.     Turn the other arm's dither alignment off
  6.     'Misalign' that arm's ETM using medm screen button
  7.     Misalign that arm's ITM manually after disabling its OpLev servos looking at the AS port camera and make sure it doesn't hit the PD anymore.
  8.     Disable dithering for primary arm
  9.     Record MC and AS time series from (paused) scope
  10.     Misalign primary ETM
  11.     Repeat scope data recording

Each pair of readings gives the reflected power at the AS port normalized to the IMC stored power:

\widehat{P}=\frac{P_{AS}-\overline{P}_{AS}^\mathrm{dark}}{P_{MC}-\overline{P}_{MC}^\mathrm{dark}}

which is then averaged. The loss is calculated from the ratio of reflected power in the locked (L) vs misaligned (M) state from

\mathcal{L}=\frac{T_1}{4\gamma}\left[1-\frac{\overline{\widehat{P}_L}}{\overline{\widehat{P}_M}} +T_1\right ]-T_2

Acquiring data this way yielded P_L/P_M=1.00507 +/- 0.00087 for the X arm and P_L/P_M=1.00753 +/- 0.00095 for the Y arm. With \gamma_x=0.832 and \gamma_x=0.875 (from m1=0.179, m2=0.226 and 91.2% and 86.7% mode matching in X and Y arm, respectively) this yields round trip losses of:

\mathcal{L}_X=21\pm4\,\mathrm{ppm}  and  \mathcal{L}_Y=13\pm4\,\mathrm{ppm}, which is assuming a generalized 1% error in test mass transmissivities and modulation indices. As we discussed, this seems a little too good to be true, but at least the numbers are not negative.

  12943   Thu Apr 13 21:01:20 2017 ranaConfigurationComputersLG UltraWide on Rossa

we installed a new curved 34" doublewide monitor on Rossa, but it seems like it has a defective dead pixel region in it. Unless it heals itself by morning, we should return it to Amazon. Please don't throw out he packing materials.

Steve 8am next morning: it is still bad The monitor is cracked. It got kicked while traveling. It's box is damaged the same place.

Shipped back 4-17-2017

Attachment 1: LG34c.jpg
LG34c.jpg
Attachment 2: crack.jpg
crack.jpg
  12965   Wed May 3 16:12:36 2017 johannesConfigurationComputerscatastrophic multiple monitor failures

It seems we lost three monitors basically overnight.

The main (landscape, left) displays of Pianosa, Rossa and Allegra are all broken with the same failure mode:

their backlights failed. Gautam and I confirmed that there is still an image displayed on all three, just incredibly faint. While Allegra hasn't been used much, we can narrow down that Pianosa's and Rossa's monitors must have failed within 5 or 6 hours of each other, last night.

One could say ... they turned to the dark side cool

Quick edit; There was a functioning Dell 24" monitor next to the iMac that we used as a replacement for Pianosa's primary display. Once the new curved display is paired with Rossa we can use its old display for Donatella or Allegra.

  12966   Wed May 3 16:46:18 2017 KojiConfigurationComputerscatastrophic multiple monitor failures

- Is there any machine that can handle 4K? I have one 4K LCD for no use.
- I also can donate one 24" Dell

  12971   Thu May 4 09:52:43 2017 ranaConfigurationComputerscatastrophic multiple monitor failures

That's a new failure mode. Probably we can't trust the power to be safe anymore.

Need Steve to order a couple of surge suppressing power strips for the monitors. The computers are already on the UPS, so they don't need it.

  12978   Tue May 9 15:23:12 2017 SteveConfigurationComputerscatastrophic multiple monitor failures

Gautam and Steve,

Surge protective power strip was install on Friday, May 5 in the Control Room

Computers not connected to the UPS are plugged into Isobar12ultra.

Quote:

That's a new failure mode. Probably we can't trust the power to be safe anymore.

Need Steve to order a couple of surge suppressing power strips for the monitors. The computers are already on the UPS, so they don't need it.

 

Attachment 1: Trip-Lite.jpg
Trip-Lite.jpg
  12993   Mon May 15 20:43:25 2017 ranaConfigurationComputerscatastrophic multiple monitor failures

this is not the right one; this Ethernet controlled strip we want in the racks for remote control.

Buy some of these for the MONITORS.

Quote:

Surge protective power strip was install on Friday, May 5 in the Control Room

Computers not connected to the UPS are plugged into Isobar12ultra.

Quote:

That's a new failure mode. Probably we can't trust the power to be safe anymore.

Need Steve to order a couple of surge suppressing power strips for the monitors. The computers are already on the UPS, so they don't need it.

 

  13037   Sun Jun 4 14:19:33 2017 ranaFrogsComputersNetwork slowdown: Martians are behind a waterwall

A few weeks ago we did some internet speed tests and found a dramatic difference between our general network and our internal Martian network in terms of access speed to the outside world.

As you can see, the speed from nodus is consistent with a Gigabit connection. But the speeds from any machine on the inside is ~100x slower. We need to take a look at our router / NAT setup to see if its an old hardware problem or just something in the software firewall. By comparison, my home internet download speed test is ~48 Mbit/s; ~6x faster than our CDS computers.


controls@megatron|~> speedtest
/usr/local/bin/speedtest:5: UserWarning: Module dap was already imported from None, but /usr/lib/python2.7/dist-packages is being added to sys.path
  from pkg_resources import load_entry_point
Retrieving speedtest.net configuration...
Testing from Caltech (131.215.115.189)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Race Communications (Los Angeles, CA) [29.63 km]: 6.52 ms
Testing download speed................................................................................
Download: 6.35 Mbit/s
Testing upload speed................................................................................................
Upload: 5.10 Mbit/s
controls@megatron|~> exit
logout
Connection to megatron closed.
controls@nodus|~ > speedtest
Retrieving speedtest.net configuration...
Testing from Caltech (131.215.115.52)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Phyber Communications (Los Angeles, CA) [29.63 km]: 2.196 ms
Testing download speed................................................................................
Download: 721.92 Mbit/s
Testing upload speed................................................................................................
Upload: 251.38 Mbit/s

Attachment 1: Screen_Shot_2017-06-04_at_1.47.47_PM.png
Screen_Shot_2017-06-04_at_1.47.47_PM.png
Attachment 2: Screen_Shot_2017-06-04_at_1.44.42_PM.png
Screen_Shot_2017-06-04_at_1.44.42_PM.png
  13044   Mon Jun 5 21:53:55 2017 ranaUpdateComputersrossa: ubuntu 16.04

With the network config, mounting, and symlinks setup, rossa is able to be used as a workstation for dataviewer and MEDM. For DTT, no luck since there is so far no lscsoft support past the Ubuntu14 stage.

  13050   Wed Jun 7 15:41:51 2017 SteveUpdateComputerswindow laptop scanned

Randy Trudeau scanned our Window laptop Dell 13" Vostro and Steve's memory stick for virus. Nothing was found. The search continues...

Rana thinks that I'm creating these virus beasts with taking pictures with Dino Capture and /or Data Ray on the window machine........

 

 

  13065   Thu Jun 15 14:24:48 2017 Kaustubh, JigyasaUpdateComputersOttavia Switched On

Today, I and Jigyasa connected the Ottavia to one of the unused monitor screens Donatella. The Ottavia CPU had a label saying 'SMOKED''. One of the past elogs, 11091, dated back in March 2015, by Jenne had an update regarding the Ottavia smelling 'burny'. It seems to be working fine for about 2 hours now. Once it is connected to the Martian Network we can test it further. The Donatella screen we used seems to have a graphic problem, a damage to the display screen. Its a minor issue and does not affect the display that much, but perhaps it'll be better to use another screen if we plan to use the Ottavia in the future. We will power it down if there is an issue with it.

  13067   Thu Jun 15 19:49:03 2017 Kaustubh, JigyasaUpdateComputersOttavia Switched On

It has been working fine the whole day(we didn't do much testing on it though). We are leaving it on for the night.

Quote:

Today, I and Jigyasa connected the Ottavia to one of the unused monitor screens Donatella. The Ottavia CPU had a label saying 'SMOKED''. One of the past elogs, 11091, dated back in March 2015, by Jenne had an update regarding the Ottavia smelling 'burny'. It seems to be working fine for about 2 hours now. Once it is connected to the Martian Network we can test it further. The Donatella screen we used seems to have a graphic problem, a damage to the display screen. Its a minor issue and does not affect the display that much, but perhaps it'll be better to use another screen if we plan to use the Ottavia in the future. We will power it down if there is an issue with it.

 

  13068   Fri Jun 16 12:37:47 2017 Kaustubh, JigyasaUpdateComputersOttavia Switched On

Ottavia had been left running overnight and it seems to work fine. There has been no smell or any noticeable problems in the working. This morning Gautam, Kaustubh and I connected Ottavia to the Matrian Network through the Netgear switch in the 40m lab area. We were able to SSH into Ottavia through Pianosa and access directories. On the ottavia itself we were able to run ipython, access the internet. Since it seems to work out fine, Kaustubh and I are going to enable the ethernet connection to Ottavia and secure the wiring now.  

Quote:

It has been working fine the whole day(we didn't do much testing on it though). We are leaving it on for the night.

Quote:

Today, I and Jigyasa connected the Ottavia to one of the unused monitor screens Donatella. The Ottavia CPU had a label saying 'SMOKED''. One of the past elogs, 11091, dated back in March 2015, by Jenne had an update regarding the Ottavia smelling 'burny'. It seems to be working fine for about 2 hours now. Once it is connected to the Martian Network we can test it further. The Donatella screen we used seems to have a graphic problem, a damage to the display screen. Its a minor issue and does not affect the display that much, but perhaps it'll be better to use another screen if we plan to use the Ottavia in the future. We will power it down if there is an issue with it.

 

 

  13071   Fri Jun 16 23:27:19 2017 Kaustubh, JigyasaUpdateComputersOttavia Connected to the Netgear Box

I just connected the Ottavia to the Netgear box and its working just fine. It'll remain switched on over the weekend.

Quote:

Kaustubh and I are going to enable the ethernet connection to Ottavia and secure the wiring now.  

 

  13154   Mon Jul 31 20:35:42 2017 KojiSummaryComputersChiara backup situation summary

Summary
- CDS Shared files system: backed up
- Chiara system itself: not backed up


controls@chiara|~> df -m
Filesystem     1M-blocks    Used Available Use% Mounted on
/dev/sda1         450420   11039    416501   3% /
udev               15543       1     15543   1% /dev
tmpfs               3111       1      3110   1% /run
none                   5       0         5   0% /run/lock
none               15554       1     15554   1% /run/shm
/dev/sdb1        2064245 1718929    240459  88% /home/cds
/dev/sdd1        1877792 1426378    356028  81% /media/fb9bba0d-7024-41a6-9d29-b14e631a2628
/dev/sdc1        1877764 1686420     95960  95% /media/40mBackup

/dev/sda1 : System boot disk
/dev/sdb1 : main cds disk file system 2TB partition of 3TB disk (1TB vacant)
/dev/sdc1 : Daily backup of /dev/sdb1 via a cron job (/opt/rtcds/caltech/c1/scripts/backup/localbackup)

/dev/sdd1 : 2014 snap shot of cds. Not actively used. USB

https://nodus.ligo.caltech.edu:8081/40m/11640

 

  13159   Wed Aug 2 14:47:20 2017 KojiSummaryComputersChiara backup situation summary

I further made the burt snapshot directories compressed along with ELOG 11640. This freed up additional ~130GB. This will eventually help to give more space to the local backup (/dev/sdc1)

controls@chiara|~> df -m
Filesystem     1M-blocks    Used Available Use% Mounted on
/dev/sda1         450420   11039    416501   3% /
udev               15543       1     15543   1% /dev
tmpfs               3111       1      3110   1% /run
none                   5       0         5   0% /run/lock
none               15554       1     15554   1% /run/shm
/dev/sdb1        2064245 1581871    377517  81% /home/cds
/dev/sdd1        1877792 1426378    356028  81% /media/fb9bba0d-7024-41a6-9d29-b14e631a2628
/dev/sdc1        1877764 1698489     83891  96% /media/40mBackup

 

 

  13160   Wed Aug 2 15:04:15 2017 gautamConfigurationComputerscontrol room workstation power distribution

The 4 control room workstation CPUs (Rossa, Pianosa, Donatella and Allegra) are now connected to the UPS.

The 5 monitors are connected to the recently acquired surge-protecting power strips.

Rack-mountable power strip + spare APC Surge Arrest power strip have been stored in the electronics cabinet.

Quote:

this is not the right one; this Ethernet controlled strip we want in the racks for remote control.

Buy some of these for the MONITORS.

 

  13227   Thu Aug 17 22:54:49 2017 ericqUpdateComputersTrying to access JetStor RAID files

The JetStor RAID unit that we had been using for frame writing before the fb meltdown has some archived frames from DRFPMI locks that I want to get at. I spent some time today trying to mount it on optimus with no success crying

The unit was connected to fb via a SCSI cable to a SCSI-to-PCI card inside of fb. I moved the card to optimus, and attached the cable. However, no mountable device corresponding to the RAID seems to show up anywhere.

The RAID unit can tell that it's hooked up to a computer, because when optimus restarts, the RAID event log says "Host Channel 0 - SCSI Bus Reset."

The computer is able to get some sort of signals from the RAID unit, because when I change the SCSI ID, the syslog will say 'detected non-optimal RAID status'.

The PCI card is ID'd fine in lspci as "06:01.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev c1)"

'lsssci' does not list anything related to the unit

Using 'mpt-status -p', which is somehow associated with this kind of thing returns the disheartening output:

Checking for SCSI ID:0
Checking for SCSI ID:1
Checking for SCSI ID:2
Checking for SCSI ID:3
Checking for SCSI ID:4
Checking for SCSI ID:5
Checking for SCSI ID:6
Checking for SCSI ID:7
Checking for SCSI ID:8
Checking for SCSI ID:9
Checking for SCSI ID:10
Checking for SCSI ID:11
Checking for SCSI ID:12
Checking for SCSI ID:13
Checking for SCSI ID:14
Checking for SCSI ID:15
Nothing found, contact the author
 
I don't know what to try at this point.
  13239   Tue Aug 22 15:17:19 2017 ericqUpdateComputersOld frames accessible again

It turns out the problem was just a bent pin on the SCSI cable, likely from having to stretch things a bit to reach optimus from the RAID unit.frown

I hooked it up to megatron, and it was automatically recognized and mounted. yes

I had to turn off the new FB machine and remove it from the rack to be able to access megatron though, since it was just sitting on top. FB needs a rail to sit on!

At a cursory glance, the filesystem appears intact. I have copied over the achived DRFPMI frame files to my user directory for now, and Gautam is going to look into getting those permanently stored on the LDAS copy of 40m frames, so that we can have some redundancy.

Also, during this time, one of the HDDs in the RAID unit failed its SMART tests, so the RAID unit wanted it replaced. There were some spare drives in a little box directly under the unit, so I've installed one and am currently incorporating it back into the RAID.

There are two more backup drives in the box. We're running a RAID 5 configuration, so we can only lose one drive at a time before data is lost.

  13240   Tue Aug 22 15:40:06 2017 gautamUpdateComputersOld frames accessible again

[jamie, gautam]

I had some trouble getting the daqd processes up and running again using Jamie's instructions.

With Jamie's help however, they are back up and running now. The problem was that the mx infrastructure didn't come back up on its own. So prior to running sudo systemctl restart daqd_*, Jamie ran sudo systemctl start mx. This seems to have done the trick.

c1iscey was still showing red fields on the CDS overview screen so Jamie did a soft reboot. The machine came back up cleanly, so I restarted all the models. But the indicator lights were still red. Apparently the mx processes weren't running on c1iscey. The way to fix this is to run sudo systemctl start mx_stream. Now everything is green.

Now we are going to work on trying the fix Rolf suggested on c1iscex.

Quote:

It turns out the problem was just a bent pin on the SCSI cable, likely from having to stretch things a bit to reach optimus from the RAID unit.frown

I hooked it up to megatron, and it was automatically recognized and mounted. yes

I had to turn off the new FB machine and remove it from the rack to be able to access megatron though, since it was just sitting on top. FB needs a rail to sit on!

At a cursory glance, the filesystem appears intact. I have copied over the achived DRFPMI frame files to my user directory for now, and Gautam is going to look into getting those permanently stored on the LDAS copy of 40m frames, so that we can have some redundancy.

Also, during this time, one of the HDDs in the RAID unit failed its SMART tests, so the RAID unit wanted it replaced. There were some spare drives in a little box directly under the unit, so I've installed one and am currently incorporating it back into the RAID.

There are two more backup drives in the box. We're running a RAID 5 configuration, so we can only lose one drive at a time before data is lost.

 

  13242   Tue Aug 22 17:11:15 2017 gautamUpdateComputersc1iscex model restarts

[jamie, gautam]

We tried to implement the fix that Rolf suggested in order to solve (perhaps among other things) the inability of some utilities like dataviewer to open testpoints. The problem isn't wholly solved yet - we can access actual testpoint data (not just zeros, as was the case) using DTT, and if DTT is used to open a testpoint first, then dataviewer, but DV itself can't seem to open testpoints.

Here is what was done (Jamie will correct me if I am mistaken).

  1. Jamie checked out branch 3.4 of the RCG from the SVN.
  2. Jamie recompiled all the models on c1iscex against this version of RCG.
  3. I shutdown ETMX watchdog, then ran rtcds stop all on c1iscex to stop all the models, and then restarted them using rtcds start <model> in the order c1x01, c1scx and c1asx. 
  4. Models came back up cleanly. I then restarted the daqd_dc process on FB1. At this point all indicators on the CDS overview screen were green.
  5. Tried getting testpoint data with DTT and DV for ETMX Oplev Pitch and Yaw IN1 testpoints. Conclusion as above.

So while we are in a better state now, the problem isn't fully solved. 

Comment: seems like there is an in-built timeout for testpoints opened with DTT - if the measurement is inactive for some time (unsure how much exactly but something like 5mins), the testpoint is automatically closed.

  13243   Tue Aug 22 18:36:46 2017 gautamUpdateComputersAll FE models compiled against RCG3.4

After getting the go ahead from Jamie, I recompiled all the FE models against the same version of RCG that we tested on the c1iscex models.

To do so:

  • I did rtcds make and rtcds install for all the models.
  • Then I ssh-ed into the FEs and did rtcds stop all, followed by rtcds start <model> in the order they are listed on the CDS overview MEDM screen (top to bottom).
  • During the compilation process (i.e. rtcds make), for some of the models, I got some compilation warnings. I believe these are related to models that have custom C code blocks in them. Jamie tells me that it is okay to ignore these warnings at that they will be fixed at some point.
  • c1lsc FE crashed when I ran rtcds stop all - had to go and do a manual reboot.
  • Doing so took down the models on c1sus and c1ioo that were running - but these FEs themselves did not have to be robooted.
  • Once c1lsc came back up, I restarted all the models on the vertex FEs. They all came back online fine.
  • Then I ssh-ed into FB1, and restarted the daqd processes - but c1lsc and c1ioo CDS indicators were still red.
  • Looks like the mx_stream processes weren't started automatically on these two machines. Reasons unknown. Earlier today, the same was observed for c1iscey.
  • I manually restarted the mx_stream processes, at which point all CDS indicator lights became green (see Attachment #1).

IFO alignment needs to be redone, but at least we now have a (admittedly rounabout way) of getting testpoints. Did a quick check for "nan-s" on the ASC screen, saw none. So I am re-enabling watchdogs for all optics.

GV 23 August 9am: Last night, I re-aligned the TMs for single arm locks. Before the model restarts, I had saved the good alignment on the EPICs sliders, but the gain of x3 on the coil driver filter banks have to be manually turned on at the moment (i.e. the safe.snap file has them off). ALS noise looked good for both arms, so just for fun, I tried transitioning control of both arms to ALS (in the CARM/DARM basis as we do when we lock DRFPMI, using the Transition_IR_ALS.py script), and was successful.

Quote:

[jamie, gautam]

We tried to implement the fix that Rolf suggested in order to solve (perhaps among other things) the inability of some utilities like dataviewer to open testpoints. The problem isn't wholly solved yet - we can access actual testpoint data (not just zeros, as was the case) using DTT, and if DTT is used to open a testpoint first, then dataviewer, but DV itself can't seem to open testpoints.

Here is what was done (Jamie will correct me if I am mistaken).

  1. Jamie checked out branch 3.4 of the RCG from the SVN.
  2. Jamie recompiled all the models on c1iscex against this version of RCG.
  3. I shutdown ETMX watchdog, then ran rtcds stop all on c1iscex to stop all the models, and then restarted them using rtcds start <model> in the order c1x01, c1scx and c1asx. 
  4. Models came back up cleanly. I then restarted the daqd_dc process on FB1. At this point all indicators on the CDS overview screen were green.
  5. Tried getting testpoint data with DTT and DV for ETMX Oplev Pitch and Yaw IN1 testpoints. Conclusion as above.

So while we are in a better state now, the problem isn't fully solved. 

Comment: seems like there is an in-built timeout for testpoints opened with DTT - if the measurement is inactive for some time (unsure how much exactly but something like 5mins), the testpoint is automatically closed.

 

Attachment 1: CDS_Aug22.png
CDS_Aug22.png
  13277   Wed Aug 30 22:15:47 2017 ranaOmnistructureComputersUSB flash drives moved

I have moved the USB flash drives from the electronics bench back into the middle drawer of the cabinet next to the AC which is west of the fridge. Drawer re-enlabeled.

  13287   Fri Sep 1 16:55:27 2017 gautamUpdateComputersTestpoints now accessible again

Thanks to Jonathan Hanks, it appears we can now access test-points again using dataviewer.

I haven't done an exhaustive check just yet, but I have loaded a few testpoints in dataviewer, and ran a script that use testpoint channels (specifically the ALS phase tracker UGF setting script), all seems good.

So if I remember correctly, the major CDS fix now required is to solve the model unloading issue.

Thanks to Jamie/Jonathan Hanks/KT for getting us back to this point! Here are the details:

After reading logs and code, it was a simple daqdrc config change.

The daqdrc should read something like this:

...
set master_config=".../master";
configure channels begin end;
tpconfig ".../testpoint.par";
...


What had happened was tpconfig was put before the configure channels
begin end.  So when daqd_rcv went to configure its test points it did
not have the channel list configured and could not match test points to
the right model & machine.  Dave and I suspect that this is so that it
can do an request directly to the correct front end instead of a general
broadcast to all awgtpman instances.

Simply reordering the config fixes it.

I tested by opening a test point in dataviewer and verifiying that
testpoints had opened/closed by using diag -l.  Xmgr/grace didn't seem
to be able to keep up with the test point data over a remote connection.

You can find this in the logs by looking for entries like the following
while the daqd is starting up.  When we looked we saw that there was an
entry for every model.

Unable to find GDS node 35 system c1daf in INI fiels
  13323   Wed Sep 20 15:49:26 2017 ranaOmnistructureComputersnew internet

Larry Wallace hooked up a new switch (Brocade FWS 648G) today which is our 40m lab interface to the outside world internet. Its faster.

He then, just now, switched over the cables which were going to the old one into the new one, including NODUS and the NAT Router. CDS machines can still connect to the outside world.

In the next week or two, he'll install a new NAT for us so that we can have high speed comm from CDS to the world.

  13405   Sun Oct 29 16:40:17 2017 ranaSummaryComputersdisk cleanup

Backed up all the wikis. Theyr'e in wiki_backups/*.tar.xz (because xz -9e gives better compression than gzip or bzip2)

Moved old user directories in the /users/OLD/

ELOG V3.1.3-