40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 241 of 344  Not logged in ELOG logo
ID Date Authorup Type Category Subject
  10906   Thu Jan 15 18:10:19 2015 jamieUpdateComputer Scripts / ProgramsInstalled kerberos on Rossa
Quote:

I have installed kerberos on Rossa, so that I don't have to type my name and password every time I do an svn checkin, since I'm making some modifications and want to be sure that everything is checked in before and afterwards. 

I ran sudo apt-get install krb5-user.  I didn't put in a default_realm when it prompted me to during install, so I went into the /etc/krb5.conf file and changed the default_realm line to read default_realm = LIGO.ORG

Now we can use kinit, but we must (as usual) remember to kdestroy our credentials when we're done.

As a reminder, to use:

> kinit albert.einstein

Password for albert.einstein@LIGO.ORG: (type your pw here)

When you're finished, run

> kdestroy

The end.

WARNING: since the workstations are all shared user, if you forget to kdestroy the next user can commit under your user ID.  It might be good to set the timeout to be something much shorter than 24 hours, like maybe 1, or 2.

  11077   Thu Feb 26 13:55:59 2015 jamieUpdateComputer Scripts / ProgramsFB IO load
We should use "ionice" to throttle the rsync. Use something like "ionice -c 3 rsync ..." to set the priority such that the rsync process will only work when there is no other IO contention. See "man ionice" for other options.
  11409   Tue Jul 14 11:57:27 2015 jamieSummaryCDSCDS upgrade: left running in semi-stable configuration
Quote:

There remains a pattern to some of the restarts, the following times are all reported as restart times. (There are others in between, however.)

daqd: Tue Jul 14 00:02:48 PDT 2015
daqd: Tue Jul 14 01:02:32 PDT 2015
daqd: Tue Jul 14 03:02:33 PDT 2015
daqd: Tue Jul 14 05:02:46 PDT 2015
daqd: Tue Jul 14 06:01:57 PDT 2015
daqd: Tue Jul 14 07:02:19 PDT 2015
daqd: Tue Jul 14 08:02:44 PDT 2015
daqd: Tue Jul 14 09:02:24 PDT 2015
daqd: Tue Jul 14 10:02:03 PDT 2015

Before the upgrade, we suffered from hourly crashes too:

daqd_start Sun Jun 21 00:01:06 PDT 2015
daqd_start Sun Jun 21 01:03:47 PDT 2015
daqd_start Sun Jun 21 02:04:04 PDT 2015
daqd_start Sun Jun 21 03:04:35 PDT 2015
daqd_start Sun Jun 21 04:04:04 PDT 2015
daqd_start Sun Jun 21 05:03:45 PDT 2015
daqd_start Sun Jun 21 06:02:43 PDT 2015
daqd_start Sun Jun 21 07:04:42 PDT 2015
daqd_start Sun Jun 21 08:04:34 PDT 2015
daqd_start Sun Jun 21 09:03:30 PDT 2015
daqd_start Sun Jun 21 10:04:11 PDT 2015

So, this isn't neccesarily new behavior, just something that remains unfixed. 

That's interesting, that we're still seeing those hourly crashes.

We're not writing out the full set of channels, though, and we're getting more failures than just those at the hour, so we're still suffering.

  11410   Tue Jul 14 13:55:28 2015 jamieUpdateCDSrunning test on daqd, please leave undisturbed

I'm running a test with daqd right now, so please do not disturb for the moment.

I'm temporarily writing frames into a tempfs, which is a filesystem that exists purely in memory.  There should be ZERO IO contention for this filesystem, so if the daqd failures are due to IO then all problems should disappear.  If they don't, then we're dealing with some other problem.

There will be no data saved during this period.

  11413   Tue Jul 14 17:06:00 2015 jamieUpdateCDSrunning test on daqd, please leave undisturbed

I have reverted daqd to the previous configuration, so that it's writing frames to disk.  It's still showing instability.

  11420   Thu Jul 16 11:18:37 2015 jamieUpdateGeneralStarting IFO recovery, DAC troubles
Quote:

I've been trying to start recovering IFO functionality, but quickly hit a frustrating roadblock. 

Upon opening the PSL shutter, and deactivating the MC mirror watchdogs, I saw the MC reflected beam moving way more than normal. 

A series of investigations revealed no signals coming out of c1sus's DAC.  crying

The IOP (c1x02) shows two of its DAC-related statewords (DAC and DK) in a fault mode, which means (quoting T1100625):

"As of RCG V2.7, if an error is detected in oneor more DAC modules, the IOP will continue to run but only write zero values to the DAC modules as a protective measure. This can only be cleared by restarting the IOP and all applications running on the affected computer."

The offending card may be DAC1, which has its fourth bit red even with only the IOP running, which corresponds to a "FIFO error". /proc/c1x02/status states, in part:

DAC #0 16-bit fifo_status=2 (OK)
DAC #1 16-bit fifo_status=3 (empty)
DAC #2 16-bit fifo_status=2 (OK)

Squishing cables and restarting the frontend have not helped anything. 

c1lsc, c1isce[x/y] are not suffering from this problem, and appear to be happily using their DACs. c1ioo does not use any DAC channels. 

We need to update the indicators on the CDS_FE_STATUS screen to expose the new indicators, so that we have better visibility for these issues.

I'm not sure why this DAC is failing. It may indicate an actual problem with the DAC itself.

Quote:

As a further headache, any time I restart any of the models on the c1sus frontend, the BURT restore is totally bunk. Moreover, using burtgooey to restore a good snapshot to the c1sus model triggers a timing overflow and model crash, maybe not so surprising since the model seems to be averaging ~56usec or so. 

This is related to changes to how the front ends load their safe.snaps. I think they're now explictly expecting the file:

targtet/<model>/<model>epics/burt/safe.snap

I'll come over this afternoon and we can get acquainted with the new SDF system that now handles management of the safe.snap files.

  11426   Sat Jul 18 14:55:33 2015 jamieUpdateGeneralall front ends back up and running

After some surgery yesterday the front ends are all back up and running:

  • Eric found that one of the DAC cards in the c1sus front end was not being properly initialized (with the new RCG code).  Turned out that it was an older version DAC, with a daughter board on top of a PCIe board.  We suspected that there was some compatibility issue with that version of the card, so Eric pulled an unused card from c1ioo to replace the one in c1sus.  That worked and now c1sus is running happily.
  • Eric put the old DAC card into c1ioo, but it didn't like it and was having trouble booting.  I removed the card and c1ioo came up fine on it's own.
  • After all front end were back up and running, all RFM connections were dead.  I tracked this down to the RFM switch being off, because the power cable was not fully seated.  This probably happened when Steve was cleaning up the 1X4/5 racks.  I re-powered the RFM switch and all the RFM connections came back on-line
  • All receivers of Dolphin (DIS) "PCIE" IPC signals from c1ioo where throwing errors.  I tracked this down to the Dolphin cable going to c1ioo being plugged in to the wrong port on the c1ioo dolphin card.  I unplugged it and plugged it into the correct port, which of course caused all front end modules using dolphin to crash.  Once I restarted all those models, everything is back:

  11428   Sat Jul 18 16:03:00 2015 jamieUpdateCDSEPICS freezes persist

I notice that the periodic EPICS freezes persist.  They last for 5-10 seconds.  MEDM completely freezes up, but then it comes back.

The sites have been noticing similar issues on a less dramatic scale.  Maybe we can learn from whatever they figure out.

  11429   Sat Jul 18 16:59:01 2015 jamieUpdateCDSunloaded, turned off loading of, symmetricom kernel module on fb

fb has been loading a 'symmetricom' kernel module, presumably because it was once being used to help with timing.  It's no longer needed, so I unloaded it and commented out the lines that loaded it in /etc/conf.d/local.start.

  11627   Mon Sep 21 15:22:19 2015 jamieUpdateDAQworking on new fb replacement

I've been putting together a new machine that Rolf got for us as a replacement for fb.

I've installed and configured the OS, and compiled daqd and the necessary supporting software.  I want to try acquiring data with it.  This will require removing the current/old fb from the DAQ network, and adding the new machine.  It should be able to be done relatively non-invasively, such that none of the front end configuration needs to be adjusted, and the old fb can be put back in place easily.

If the test is successfully, then I'll push ahead with the rest of the replacement (such as either moving or copying the /frames RAID to the new machine).

I will do this work in the early AM tomorrow, September 22, 2015.

  11636   Tue Sep 22 17:30:55 2015 jamieUpdateDAQattempts at getting new fb working

Today I've been trying to get the new frame builder, tentatively 'fb1', to work.  It's not fully working yet, so I'm about to revert the system back to using 'fb'.  The switch-over process is annoying, since our one myrinet card has to be moved between the hosts.

A brief update on the process so far:

I'm being a little bold with this system by trying to build daqd against more system libraries, instead of the manually installed stuff usually nominally required.  Here's some of the relevant info about th fb1 system:

  • Debian 7 (wheezy)
  • lscsoft ldas-tools-framecpp-dev 2.4.1-1+deb7u0
  • lscsoft gds-dev 2.17.2-2+deb7u0
  • lscsoft libmetaio-dev 8.4.0-1+deb7u0
  • lscsoft libframe-dev 8.20-1+deb7u0
  • /opt/rtapps/epics-1.4.12.2_long
  • /opt/mx-1.2.16
  • advLigoRTS trunk

I finally managed to get daqd to build against the advLigoRTS trunk (post 2.9 branch).  I'll post detailed build log once I work out all the kinks.  It runs ok, including writing out full frames, as well as second and minute trends and raw minute trends, but there are a couple of show-stopper problems:

  • daqd segfaults if the C1EDCU.ini is specified.  If I comment out that one file from the 'master' channel ini file list then it runs without segfaulting.
  • Something is going on with the mx_streams from the front ends:
    • They appear to look ok from the daqd side, but the FEC-<ID>_FB_NET_STATUS indicators remain red.  The "DAQ" bit in the STATE_WORD is also red.  Again, this is even though data seems to be flowing.
    • The mx_stream processes on the front ends are dying (and restarting via monit) about every 2 minutes.  It's unclear what exactly is happening, but they all dia around the same time, so it possibly initiated from a daqd problem.  Around the time of the mx_stream failures, we see this in the daqd log:
[Tue Sep 22 17:24:07 2015] GPS MISS dcu 91 (TST); dcu_gps=1127003062 gps=1127003063

Aborted 1 send requests due to remote peer Aborted 1 send requests due to remote peer 00:25:90:0d:75:bb (c1sus:0) disconnected
mx_wait failed in rcvr eid=004, reqn=11; wait did not complete; status code is Remote endpoint is closed
00:30:48:d6:11:17 (c1iscey:0) disconnected
mx_wait failed in rcvr eid=002, reqn=235; wait did not complete; status code is Remote endpoint is closed
disconnected from the sender on endpoint 002
mx_wait failed in rcvr eid=005, reqn=253; wait did not complete; status code is Bad session (missing mx_connect?)
disconnected from the sender on endpoint 005
disconnected from the sender on endpoint 004
[Tue Sep 22 17:24:13 2015] GPS MISS dcu 39 (PEM); dcu_gps=1127003062 gps=1127003069
  • Occaissionally the daqd process dies when the front end mx_streams processes die.

I'll keep investigating, hopefully with some feedback from Keith and Rolf tomorrow.

  11645   Fri Sep 25 17:51:11 2015 jamieUpdateDAQfb replacement work update

Brief update about the fb replacement status.

The new hardware for fb is in the rack, temporarily sitting on top of megatron, and on the CDS network with the name 'fb1'.  I've installed an OS on it and have re-built daqd.

Earlier this week I swapped it into the network and tried to get it to acquire data from the front ends.  I was ultimately unsuccessfully.  The problem seemed to be the mx_stream communication from the front ends to the new host.

The swap is sort of a pain because we only have one Myrinet fiber network adapter card that has to be moved between machines, which of course requires shutting down both machines and opening up their chassis.  I instructed Steve to order us a new Myrinet card for the new machine, which will allow us to swap daqd machines by just moving the fiber connection.  Once that's in place (early next week) I'll go back to trying to figure out what the issue is with the mx_streams.

If all else fails I'll take the repulsive last resort of either swapping or cloning the disk from the old fb.

  11653   Wed Sep 30 13:59:49 2015 jamieUpdateDAQattempts at getting new fb working

I got Steve to get us a new Myrinet fiber network adapter card for fb1:

  • Myrinet 10G-PCIE-8B-S

I just finished installing the card in fb1, and it came up fine.  We happened to have a spare fiber, and a spare fiber jack in the DAQ switch, so I went ahead and plugged it in in parallel to the old fb:

controls@fb1:~/rtbuild/trunk 130$ /opt/mx/bin/mx_info
MX Version: 1.2.16
MX Build: controls@fb1:/opt/src/mx-1.2.16 Fri Sep 18 18:32:59 PDT 2015
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
    8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0:  364.4 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
    Status:         Running, P0: Link Up
    Network:        Ethernet 10G

    MAC Address:    00:60:dd:43:74:62
    Product code:   10G-PCIE-8B-S
    Part number:    09-04228
    Serial number:  485052
    Mapper:         00:60:dd:46:ea:ec, version = 0x00000000, configured
    Mapped hosts:   7

                                                        ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                        P0
-----    -----------     ---------                        ---
   0) 00:60:dd:43:74:62 fb1:0                             1,0
   1) 00:25:90:0d:75:bb c1sus:0                           1,0
   2) 00:30:48:be:11:5d c1iscex:0                         1,0
   3) 00:30:48:d6:11:17 c1iscey:0                         1,0
   4) 00:30:48:bf:69:4f c1lsc:0                           1,0
   5) 00:14:4f:40:64:25 c1ioo:0                           1,0
   6) 00:60:dd:46:ea:ec fb:0                              1,0

We can now work on fb1 while fb continues to run and collect data from the front ends.

I'm still not getting the mx_stream connections to the new fb1 daq to work.  I'm leaving everything running as is on fb for the moment.

  11655   Thu Oct 1 19:49:52 2015 jamieUpdateDAQmore failed attempts at getting new fb working

Summary

I've not really been able to make additional progress with the new 'fb1' DAQ.  It's still flaky as hell.  Therefore we're still using old 'fb'.

Issues

mx_stream

The mx_stream processes on the front ends initially run fine, connecting to the daqd and transferring data, with both DAQ-..._STATUS and FE-..._FB_NET_STATUS indicators green.  Then after about two minutes all the mx_stream processes on all the front ends die.  Monit eventually restarts them all, at which point they come up green for a while until the crash again ~2 minutes later.  This is essentially the same situation as reported previously.

In the daqd logs when the mx_streams die:

Aborted 2 send requests due to remote peer 00:30:48:be:11:5d (c1iscex:0) disconnected
Aborted 2 send requests due to remote peer 00:14:4f:40:64:25 (c1ioo:0) disconnected
Aborted 2 send requests due to remote peer 00:30:48:d6:11:17 (c1iscey:0) disconnected
Aborted 2 send requests due to remote peer 00:25:90:0d:75:bb (c1sus:0) disconnected
Aborted 1 send requests due to remote peer 00:30:48:bf:69:4f (c1lsc:0) disconnected
mx_wait failed in rcvr eid=000, reqn=176; wait did not complete; status code is Remote endpoint is closed
disconnected from the sender on endpoint 000
mx_wait failed in rcvr eid=000, reqn=177; wait did not complete; status code is Connectivity is broken between the source and the destination
disconnected from the sender on endpoint 000
mx_wait failed in rcvr eid=000, reqn=178; wait did not complete; status code is Connectivity is broken between the source and the destination
disconnected from the sender on endpoint 000
mx_wait failed in rcvr eid=000, reqn=179; wait did not complete; status code is Connectivity is broken between the source and the destination
disconnected from the sender on endpoint 000
mx_wait failed in rcvr eid=000, reqn=180; wait did not complete; status code is Connectivity is broken between the source and the destination
disconnected from the sender on endpoint 000
[Thu Oct  1 19:00:09 2015] GPS MISS dcu 39 (PEM); dcu_gps=1127786407 gps=1127786425

[Thu Oct  1 19:00:09 2015] GPS MISS dcu 39 (PEM); dcu_gps=1127786408 gps=1127786426

[Thu Oct  1 19:00:09 2015] GPS MISS dcu 39 (PEM); dcu_gps=1127786408 gps=1127786426

In the mx_stream logs:

controls@c1iscey ~ 0$ /opt/rtcds/caltech/c1/target/fb/mx_stream -r 0 -W 0 -w 0 -s 'c1x05 c1scy c1tst' -d fb1:0
mmapped address is 0x7f0df23a6000
mmapped address is 0x7f0dee3a6000
mmapped address is 0x7f0dea3a6000
send len = 263596
Connection Made
isendxxx failed with status Remote Endpoint Unreachable
disconnected from the sender

daqd

While the mx_stream processes are running daqd seems to write out data just fine.  At least for the full frames.  I manually verified that there is indeed data in the frames that are written.

Eventually, though, daqd itself crashes with the same error that we've been seeing:

main profiler warning: 0 empty blocks in the buffer

I'm not exactly sure what the crashes are coincident with, but it looks like they are also coincident with the writing out of the minute and/or second trend files.  It's unclear how it's related to the mx_stream crashes, if at all.  The mx_stream crashes happen every couple of minutes, whereas the daqd itself crashes much less frequently.

The new daqd can't handle EDCU files.  If an EDCU file is specified (e.g. C0EDCU.ini in our case), the daqd will segfault very soon after startup.  This was an issue with the current daqd on fb, but was "fixed" by moving where the EDCU file was specified in the master file.

Conclusion

There are a number of differences between the fb1 and fb configurations:

  • newer OS (Debian 7 vs. ancient gentoo)
  • newer advLigoRTS (trunk vs. 2.9.4)
  • newer framecpp library installed from LSCSoft Debian repo (2.4.1-1+deb7u0 vs. 1.19.32-p1)

It's possible those differences could account for the problems (/opt/rtapps/epics incompatible with this Debian install, for instance).  Somehow I doubt it.  I wonder if all the weird network issues we've been seeing are somehow involved.  If the NFS mount of chiara is problematic for some reason that would affect everything that mounts it, which includes all the front ends and fb/fb1.

There are two things to try:

  • Fix the weird network problem.  Try remove EVERYTHING from the network except for chiara, fb/fb1, and the front ends and see if that helps.
  • Rebuild fb1 with Ubuntu and daqd as prescribed by Keith Thorne.
  11656   Thu Oct 1 20:24:02 2015 jamieUpdateDAQmore failed attempts at getting new fb working

I just realized that when running fb1, if a single mx_stream dies they all die.

  11657   Thu Oct 1 20:26:21 2015 jamieUpdateDAQSwapping between fb and fb1

Swapping between fb and fb1 as DAQ is very straightforward, now that they are both on the DAQ network:

  • stop daqd on fb
  • on fb sudoedit /diskless/root/etc/init.d/mx_stream and set: endpoint=fb1:0
  • start daqd on fb1.  The "new" daqd binary on fb1 is at: ~controls/rtbuild/trunk/build/mx-localtime/daqd

Once daqd starts, the front end mx_stream processes will be restarted by their monits, and be pointing to the new location.

Moving back is just reversing those steps.

  11661   Sun Oct 4 12:07:11 2015 jamieConfigurationCDSCSD network tests in progress

I'm about to start conducting some tests on the CDS network.  Things will probably be offline for a bit.  Will post when things are back to normal.

  11662   Sun Oct 4 13:53:30 2015 jamieUpdateLSCSENSMAT oscillator used for EPICS tests

I've taken over one of the SENSMAT oscillators for a test of the EPICS system.

These are the channels I've modified, with their original and current settings:

controls@donatella|~ > caget C1:LSC-OUTPUT_MTRX_7_13 C1:CAL-SENSMAT_CARM_OSC_FREQ C1:CAL-SENSMAT_CARM_OSC_CLKGAIN
C1:LSC-OUTPUT_MTRX_7_13          -1
C1:CAL-SENSMAT_CARM_OSC_FREQ    309.21
C1:CAL-SENSMAT_CARM_OSC_CLKGAIN   0
controls@donatella|~ > caget C1:LSC-OUTPUT_MTRX_7_13 C1:CAL-SENSMAT_CARM_OSC_FREQ C1:CAL-SENSMAT_CARM_OSC_CLKGAIN
C1:LSC-OUTPUT_MTRX_7_13           0
C1:CAL-SENSMAT_CARM_OSC_FREQ      0.1
C1:CAL-SENSMAT_CARM_OSC_CLKGAIN   3
controls@donatella|~ >

 

 

  11663   Sun Oct 4 14:23:42 2015 jamieConfigurationCDSCSD network test complete

I've finished, for now, the CDS network tests that I was conducting.  Everything should be back to normal.

What I did:

I wanted to see if I could make the EPICS glitches we've been seeing go away if I unplugged everything from the CDS martian switch in 1X6 except for:

  • fb
  • fb1
  • chiara
  • all the front end machines

What I unplugged were things like megatron, nodus, the slow computers, etc.  The control room workstations were still connected, so that I could monitor.

I then used StripTool to plot the output of a front end oscillator that I had set up to generate a 0.1 Hz sine wave (see elog 11662).  The slow sine wave makes it easy to see the glitches, which show up as flatlines in the trace.

More tests are needed, but there was evidence that unplugging all the extra stuff from the switch did make the EPICS glitches go away.  During the duration of the test I did not see any EPICS glitches.  Once I plugged everything back in, I started to see them again.  However, I'm currently not seeing many glitches (with everything plugged back in) so I'm not sure what that means.  I think more tests are needed.  If unplugging everything did help, we still need to figure out which machine is the culprit.

  11664   Sun Oct 4 14:28:03 2015 jamieUpdateDAQmore failed attempts at getting new fb working

I tried to look at fb1 again today, but still haven't made any progress.

The one thing I did notice, though, is that every hour on the hour the fb1 daqd process dies in an identical manor to how the fb daqd dies, with these:

[Sun Oct  4 12:02:56 2015] main profiler warning: 0 empty blocks in the buffer

errors right as/after it tries to write out the minute trend frames.

This makes me think that this new hardware isn't actually going to fix the problem we've been seeing with the fb daqd, even if we do get daqd "working" on fb1 as well as it's currently working on fb.

  11665   Sun Oct 4 14:32:49 2015 jamieConfigurationCDSCSD network test complete

Here's an example of the glitches we've been seeing, as seen in the StripTool trace of the front end oscillator:

You can clearly see the glitch at around T = -18.  Obviously during non-glitch times the sine wave is nice and cleanish (there are still the very small discretisation from the EPICS sample times).

  11859   Mon Dec 7 11:25:10 2015 jamieUpdateCDSdaqd is mad
Quote:

A question to Jamie: although the new framebuilder prototype still had the same problem with trend writing, can it handle this higher testpoint/DQ channel load?

The new fb1 daqd was also crashing even without the trend writing enabled.  I'm not sure how much that's affected by the load, though, e.g. it might be able to handle the extra load fine but then die because of some other issue not related to the number of channels being acquired.

We should schedule some time this week to work on fb1 some more.

  11868   Wed Dec 9 19:01:45 2015 jamieUpdateCDSback to fb1

I spent this afternoon trying to debug fb1, with very little to show for it.  We're back to running from fb.

The first thing I did was to recompile EPICS from source, so that all the libraries needed by daqd were compiled for the system at hand.  I compiled epics-3.14-12-2_long from source, and installed it at /opt/rtapps/epics on local disk, not on the /opt/rtapps network mount.  I then recompiled daqd against that, and the framecpp, gds, etc from the LSCSoft packages.  So everything has been compiled for this version of the OS.  The compilation goes smoothly.

There are two things that I see while running this new daqd on fb1:

instability with mx_streams

The mx stream connection between the front ends and the daqd is flaky.  Everything will run fine for a while, the spontaneously one or all of the mx_stream processes on the front ends will die.  It appears more likely that all mx_stream processes will die at the same time.  It's unclear if this is some sort of chain reaction thing, or if something in daqd or in the network itself is causing them all to die at the same time.  It is independent of whether or not we're using multiple mx "end points" (i.e. a different one for each front end and separate receiver threads in the daqd) or just a single one (all front ends connecting to a single mx receiver thread in daqd).

Frequently daqd will recover from this.  The monit processes on the front ends restart the mx_stream processes and all will be recovered.  However occaissionally, possibly if the mx_streams do not recover fast enough (which seems to be related to how frequently the receiver threads in daqd can clear themselves), daqd will start to choke and will start spitting out the "empty blocks" messages that are harbirnger of doom:

Aborted 2 send requests due to remote peer 00:30:48:be:11:5d (c1iscex:0) disconnected
00:30:48:d6:11:17 (c1iscey:0) disconnected
mx_wait failed in rcvr eid=005, reqn=182; wait did not complete; status code is Remote endpoint is closed
disconnected from the sender on endpoint 005
mx_wait failed in rcvr eid=001, reqn=24; wait did not complete; status code is Remote endpoint is closed
disconnected from the sender on endpoint 001
[Wed Dec  9 18:40:14 2015] main profiler warning: 1 empty blocks in the buffer
[Wed Dec  9 18:40:15 2015] main profiler warning: 0 empty blocks in the buffer
[Wed Dec  9 18:40:16 2015] main profiler warning: 0 empty blocks in the buffer

My suspicion is that this time of failure is tied to the mx stream failures, so we should be looking at the mx connections and network to solve this problem.

frame writing troubles

There's possibly a separate issue associated with writing the second or minute trend files to disk.  With fair regularity daqd will die soon after it starts to write out the trend frames, producing the similar "empty blocks" messages.

  12152   Tue Jun 7 11:12:47 2016 jamieUpdateCDSDAQD UPGRADE WORK UNDERWAY

I am re-starting work on the daqd upgrade again now.  Expect the daqd to be offline for most of the day.  I will report progress.

  12155   Tue Jun 7 20:49:50 2016 jamieUpdateCDSDAQD work ongoing

Summary: new daqd code running overnight test on fb1.  Stability issues persist.

The code is from Keith's "tests/advLigoRTS-40m" branch, which is a branch of the current trunk.  It's supposed to include patches to fix the crashes when multiple frame types are written to disk at the same time.  However, the issue is not fixed:

2016-06-07_20:38:55 about to write frame @ 1149392336
2016-06-07_20:38:55 Begin Full WriteFrame()
2016-06-07_20:38:57 full frame write done in 2seconds
2016-06-07_20:39:11 about to write frame @ 1149392352
2016-06-07_20:39:11 Begin Full WriteFrame()
2016-06-07_20:39:13 full frame write done in 2seconds
2016-06-07_20:39:27 about to write frame @ 1149392368
2016-06-07_20:39:27 Begin Full WriteFrame()
2016-06-07_20:39:29 full frame write done in 2seconds
2016-06-07_20:39:43 about to write second trend frame @ 1149391800
2016-06-07_20:39:43 Begin second trend WriteFrame()
2016-06-07_20:39:43 about to write frame @ 1149392384
2016-06-07_20:39:43 Begin Full WriteFrame()
2016-06-07_20:39:44 full frame write done in 1seconds
2016-06-07_20:39:59 about to write frame @ 1149392400
2016-06-07_20:40:04 Begin Full WriteFrame()
2016-06-07_20:40:04 Second trend frame write done in 21 seconds
2016-06-07_20:40:14 [Tue Jun  7 20:40:14 2016] main profiler warning: 1 empty blocks in the buffer
2016-06-07_20:40:15 [Tue Jun  7 20:40:15 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:16 [Tue Jun  7 20:40:16 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:17 [Tue Jun  7 20:40:17 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:18 [Tue Jun  7 20:40:18 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:19 [Tue Jun  7 20:40:19 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:20 [Tue Jun  7 20:40:20 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:21 [Tue Jun  7 20:40:21 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:22 [Tue Jun  7 20:40:22 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:23 [Tue Jun  7 20:40:23 2016] main profiler warning: 0 empty blocks in the buffer

This failure comes when a full frame (1149392384+16) is written to disk at the same time as a second trend (1149391800+600).  It seems like every time this happens daqd crashes.

I have seen other stability issues as well, maybe caused by mx flakiness, or some sort of GPS time synchronization issue caused by our lack of IRIG-B cards.  I'm going to look to see if I can get the GPS issue taken care of so we take that out of the picture.

For the last couple of hours I've only seen issues with the frame writing every 20 minutes, when the full and second trend frames happen to be written at the same time.  Running overnight to gather more statistics.

  12156   Wed Jun 8 08:34:55 2016 jamieUpdateCDSDAQD work ongoing

38 restarts overnight.  Problem definitely not fixed.  I'll be reverting back to old daqd and fb this morning.  Then regroup and evaluate options.

  12158   Wed Jun 8 13:50:39 2016 jamieConfigurationCDSSpectracom IRIG-B card installed on fb1

[EDIT: corrected name of installed card]

We just installed a Spectracom TSyc-PCIe timing card on fb1.  The hope is that this will help with the GPS timeing syncronization issues we've been seeing in the new daqd on fb1, hopefully elliminating some of the potential failure channels.

The driver, called "symmetricom" in the advLigoRTS source (name of product from competing vendor), was built/installed (from DCC T1500227):

controls@fb1:~/rtscore/tests/advLigoRTS-40m 0$ cd src/drv/symmetricom/
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$ ls
Makefile  stest.c  symmetricom.c  symmetricom.h
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$ make
make -C /lib/modules/3.2.0-4-amd64/build SUBDIRS=/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom modules
make[1]: Entering directory `/usr/src/linux-headers-3.2.0-4-amd64'
  CC [M]  /home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.o
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c:59:9: warning: initialization from incompatible pointer type [enabled by default]
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c:59:9: warning: (near initialization for ‘symmetricom_fops.unlocked_ioctl’) [enabled by default]
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c: In function ‘get_cur_time’:
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c:89:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c: In function ‘symmetricom_init’:
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c:188:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c:222:3: warning: label ‘out_remove_proc_entry’ defined but not used [-Wunused-label]
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c:158:22: warning: unused variable ‘pci_io_addr’ [-Wunused-variable]
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c:156:6: warning: unused variable ‘i’ [-Wunused-variable]
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.mod.o
  LD [M]  /home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.ko
make[1]: Leaving directory `/usr/src/linux-headers-3.2.0-4-amd64'
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$ sudo make install
#remove all old versions of the driver
find /lib/modules/3.2.0-4-amd64 -name symmetricom.ko -exec rm -f {} \; || true
find /lib/modules/3.2.0-4-amd64 -name symmetricom.ko.gz -exec rm -f {} \; || true
# Install new driver
install -D -m 644 symmetricom.ko /lib/modules/3.2.0-4-amd64/extra/symmetricom.ko
/sbin/depmod -a || true
/sbin/modprobe symmetricom
if [ -e /dev/symmetricom ] ; then \
        rm -f /dev/symmetricom ; \
    fi
mknod /dev/symmetricom c `grep symmetricom /proc/devices|awk '{print $1}'` 0
chown controls /dev/symmetricom
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$ ls /dev/symmetricom
/dev/symmetricom
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$ ls -al /dev/symmetricom
crw-r--r-- 1 controls root 250, 0 Jun  8 13:42 /dev/symmetricom
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$ 
  12161   Thu Jun 9 13:28:07 2016 jamieConfigurationCDSSpectracom IRIG-B card installed on fb1

Something is wrong with the timing we're getting out of the symmetricom driver, associated with the new spectracom card.

controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 127$ lalapps_tconvert 
1149538884
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$ cat /proc/gps 
704637380.00
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$ 

The GPS time is way off, and it's counting up at something like 900 seconds/second.  Something is misconfigured, but I haven't figured out what yet.

The timing distribution module we're using is spitting out what appears to be an IRIG B122 signal (amplitude moduled 1 kHz carrier), which I think is what we expect.  This is being fed into the "AM IRIG input" connector on the card.

Not sure why the driver is spinning so fast, though, with the wrong baseline time.  Reboot of the machine didn't help.

  12162   Thu Jun 9 15:14:46 2016 jamieUpdateCDSold fb restarted, test of new daqdon fb1 aborted for time being

I've restarted the old daqd on fb until I can figure out what's going on with the symmetricom driver on fb1.

Steve:  Jamie with hair.... long time ago
 

  12166   Fri Jun 10 12:09:01 2016 jamieConfigurationCDSIRIG-B debugging

Looks like we might have a problem with the IRIG-B output of the GPS receiver.

Rolf came over this morning to help debug the strange symmetricom driver behavior on fb1 with the new Spectracom card.  We restarted the machine againt and this time when we loaded the drive rit was clocking at a normal rate (second/second).  However, the overall GPS time was still wrong, showing a time in October from this year.

The IRIG-B122 output is supposed to encode the time of year via amplitude modulation of a 1kHz carrier.  The current time of year is:

controls@fb1:~ 0$ TZ=utc date +'%j day, %T'
162 day, 18:57:35
controls@fb1:~ 0$ 

The absolute year is not encoded, though, so the symmetricon driver has the year offset hard coded into the driver (yuck), to which it adds the time of year from the IRIG-B signal to get the correct GPS time.

However, loading the symmetricom module shows the following:

...
[ 1601.607403] Spectracom GPS card on bus 1; device 0
[ 1601.607408] TSYNC PIC BASE 0 address = fb500000
[ 1601.607429] Remapped 0xffffc90017012000
[ 1606.606164] TSYNC NOT receiving YEAR info, defaulting to by year patch
[ 1606.606168] date = 299 days 18:28:1161455320
[ 1606.606169] bcd time = 1161455320 sec  959 milliseconds 398 microseconds  959398630 nanosec
[ 1606.606171] Board sync = 1
[ 1606.616076] TSYNC NOT receiving YEAR info, defaulting to by year patch
[ 1606.616079] date = 299 days 18:28:1161455320
[ 1606.616080] bcd time = 1161455320 sec  969 milliseconds 331 microseconds  969331350 nanosec
[ 1606.616081] Board sync = 1
controls@fb1:~ 0$ 

Apparently the symmetricom driver thinks it's the 299nth day of the year, which of course corresponds to some time in october, which jives with the GPS time the driver is spitting out.

Rolf then noticed that the timing module in the VME crate in the adjacent rack, which also receives an IRIG-B signal from the distribution box, was also showing day 299 on it's front panel display. We checked and confirmed that the symmetricom card and the VME timing module both agree on the wrong time of year, strongly suggesting that the GPS receiver is outputing bogus data on it's IRIG-B output, even though it's showing the correct time on it's front panel.  We played around with setting in the GPS receiver to no avail.  Finally we rebooted the GPS receiver, but it seemed to come up with the same bogus IRIG-B output (again both symmetricom driver and VME timing module agree on the wrong day).

So maybe our GPS receiver is busted?  Not sure what to try now.

 

  12167   Fri Jun 10 12:21:54 2016 jamieConfigurationCDSGPS receiver not resetting properly

The GPS receiver (EndRun Technologies box in 1Y5? (rack closest to door)) seems to not coming back up properly after the reboot.  The front pannel says that it's "LKD", but the "sync" LED is flashing instead of solid, and the time of year displayed on the front panel is showing day 6.  The fb1 symmetricom driver and VME timing module are still both seeing day 299, though.  So something may definitely be screwy with the GPS receiver.

  12179   Tue Jun 14 19:37:40 2016 jamieUpdateCDSOvernight daqd test underway

I'm running another overnight test with new daqd software on fb1.  The normal daqd process on fb has been shutdown, and the front ends are sending their signals to fb1.

fb1 is running separate data concentrator (dc) and  frame writer (fw) processes, to see if this is a more stable configuration than the all-in-one framebuilder (fb) that we have been trying to run with.  I'll report on the test tomorrow.

  12181   Wed Jun 15 09:52:02 2016 jamieUpdateCDSVery encouraging results from overnight split daqd test

laughVery encouraging results from the test last night.  The new configuration did not crash once overnight, and seemed to write out full, second trend, and minute trend frames without issueyes.  However, full validity of all the written out frames has not been confirmed.

overview

The configuration under test involves two separate daqd binaries instead of one.  We usually run with what is referred to as a "framebuilder" (fb) configuration:

  • fb: a single daqd binary that:
    • collect the data from the front ends
    • coallate full data into frame file format
    • calculates trend data
    • writes frame files to disk.

The current configuration separates the tasks into multiple separate binaries: a "data concentrator" (dc) and a "frame writer" (fw):

  • dc:
    • collect data from front ends
    • coallate full data into frame file format
    • broadcasts frame files over local network
  • fw:
    • receives frame files from broadcast
    • calculates trend data
    • writes frame files to disk

This configuration is more like what is run at the sites, where all the various components are separate and run on separate hardware.  In our case, I tried just running the two binaries on the same machine, with the broadcast going over the loopback interface.  None of the systems that use separated daqd tasks see the failures that we've been seeing with the all-in-one fb configuration (and other sites like AEI have also seen).

My guess frown is that there's some busted semaphore somewhere in daqd that's being shared between the concentrator and writer components.  The writer component probably aquires the lock while it's writing out the frame, which prevents the concentrator for doing what it needs to be doing while the frame is being written out.  That causes the concentrator to lock up and die if the frame writing takes too long (which it seems to almost necessarily do, especially when trend frames are also being written out).

results

The current configuration hasn't been tweaked or optimized at all.  There is of course basically no documentation on the meaning of the various daqdrc directives.  Hopefully I can get Keith Thorne to help me figure out a well optimized configuration.

There is at least one problem whereby the fw component is issuing an excessively large number of re-transmission requests:

2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 6 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 8 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 3 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 6 packets; port 7097
2016-06-15_09:46:23 [Wed Jun 15 09:46:23 2016] Ask for retransmission of 1 packets; port 7097

It's unclear why.  Presumably the retransmissions requests are being honored, and the fw eventually gets the data it needs.  Otherwise I would hope that there would be the appropriate errors.

The data is being written out as expected:

 full/11500: total 182G
drwxr-xr-x  2 controls controls 132K Jun 15 09:37 .
-rw-r--r--  1 controls controls  69M Jun 15 09:37 C-R-1150043856-16.gwf
-rw-r--r--  1 controls controls  68M Jun 15 09:37 C-R-1150043840-16.gwf
-rw-r--r--  1 controls controls  68M Jun 15 09:37 C-R-1150043824-16.gwf
-rw-r--r--  1 controls controls  69M Jun 15 09:36 C-R-1150043808-16.gwf
-rw-r--r--  1 controls controls  69M Jun 15 09:36 C-R-1150043792-16.gwf
-rw-r--r--  1 controls controls  68M Jun 15 09:36 C-R-1150043776-16.gwf
-rw-r--r--  1 controls controls  68M Jun 15 09:36 C-R-1150043760-16.gwf
-rw-r--r--  1 controls controls  69M Jun 15 09:35 C-R-1150043744-16.gwf

 trend/second/11500: total 11G
drwxr-xr-x  2 controls controls 4.0K Jun 15 09:29 .
-rw-r--r--  1 controls controls 148M Jun 15 09:29 C-T-1150042800-600.gwf
-rw-r--r--  1 controls controls 148M Jun 15 09:19 C-T-1150042200-600.gwf
-rw-r--r--  1 controls controls 148M Jun 15 09:09 C-T-1150041600-600.gwf
-rw-r--r--  1 controls controls 148M Jun 15 08:59 C-T-1150041000-600.gwf
-rw-r--r--  1 controls controls 148M Jun 15 08:49 C-T-1150040400-600.gwf
-rw-r--r--  1 controls controls 148M Jun 15 08:39 C-T-1150039800-600.gwf
-rw-r--r--  1 controls controls 148M Jun 15 08:29 C-T-1150039200-600.gwf
-rw-r--r--  1 controls controls 148M Jun 15 08:19 C-T-1150038600-600.gwf

 trend/minute/11500: total 152M
drwxr-xr-x 2 controls controls 4.0K Jun 15 07:27 .
-rw-r--r-- 1 controls controls  51M Jun 15 07:27 C-M-1150023600-7200.gwf
-rw-r--r-- 1 controls controls  51M Jun 15 04:31 C-M-1150012800-7200.gwf
-rw-r--r-- 1 controls controls  51M Jun 15 01:27 C-M-1150002000-7200.gwf

The frame sizes look more or less as expected, and they seem to be valid as determined with some quick checks with the framecpp command line utilities.

  12183   Wed Jun 15 11:21:51 2016 jamieUpdateCDSstill work to do to transition to new configuration/code

Just to be clear, there's still quite a bit of work to fully transition the 40m to this new system/configuration.  Once we determine a good configuration we need to complete the install, and modify the setup to run the two binaries instead of just the one.  The data is also being written to a raid on the new fb1, and we need to decide if we should use this new raid, or try to figure out how to move the old jetstor raid to the new fb1 machine.

  12191   Thu Jun 16 16:11:11 2016 jamieUpdateCDSupgrade aborted for now

After poking at the new configuration more, it also started to show instability.  I couldn't figure out how to make test points or excitations available in this configuration, and adding in the full set of test point channels, and trying to do simple things like plotting channels with dtt, the frame writer (fw) would fall over, apparetnly unable to keep up with the broadcast from the dc.

I've revered everything back to the old semi-working fb configuration, and will be kicking this to the CDS group to deal with.

  12201   Mon Jun 20 11:19:41 2016 jamieConfigurationCDSGPS receiver not resetting properly
Quote:

I called https://www.endruntechnologies.com/pdf/USM3014-0000-000.pdf  and they said it's very likely just needs a software update. They will email Jamie the details.

I got the email from them.  There was apparently a bug that manifested on February 14 2016.  I'll try to software update today.

http://endruntechnologies.com/pdf/FSB160218.pdf
http://endruntechnologies.com/upgradetemplx.htm

  12202   Mon Jun 20 14:03:04 2016 jamieConfigurationCDSEndRun GPS receiver upgraded, fixed

I just upgraded the EndRun Technologies Tempus LX GPS receiver timing unit, and it seems to have fixed all the problems.  cool

Thanks to Steve for getting the info from EndRun.  There was indeed a bug in the firmware that was fixed with a firmware upgrade.

I upgraded both the system firmware and the firmware of the GPS subsystem:

Tempus LX GPS(root@Tempus:~)-> gntpversion 
Tempus LX GPS 6010-0044-000 v 5.70 - Wed Oct 1 04:28:34 UTC 2014
Tempus LX GPS(root@Tempus:~)-> gpsversion 
F/W 5.10 FPGA 0416
Tempus LX GPS(root@Tempus:~)->

After reboot the system is fully functional, displaying the correct time, and outputting the correct IRIG-B data, as confirmed by the VME timing unit.

I added a wiki page for the unit: https://wiki-40m.ligo.caltech.edu/NTP

 

Steve added this picture

  12724   Mon Jan 16 22:03:30 2017 jamieConfigurationComputersMegatron update
Quote:
 

We should consider upgrading a few of our workstations to Ubuntu 14 LTS to see how painful it is to run our scripts and DTT and DV. Better to upgrade a bit before we are forced to by circumstance.

I would recommend upgrading the workstations to one of the reference operating systems, either SL7 or Debian squeeze, since that's what the sites are moving towards.  If you do that you can just install all the control room software from the supported repos, and not worry about having to compile things from source anymore.

  12763   Fri Jan 27 17:49:41 2017 jamieUpdateCDStest of new daqd code on fb1

Just FYI I'm running a test of updated daqd code on fb1. 

fb1 has it's own fiber to the daq network switch, so nothing had to be modified to do this test. This *should* not affect anything in the rest of the system, but as we all know these are famous last words....  If something is going haywire, and you can't get in touch with me and can't figure what else to do, you can just log on to fb1 and shut it down.  It's not writing any data to any of the network filesystems.

The daqd code under test is from the latest advLigoRTS 3.2.1 tag, which has daqd stability fixes that will hopefully address the problems we were seeing last time I tried this upgrade.  We'll see...

I'm going to let it run over the weekend, and will check in periodically.

  12769   Sat Jan 28 12:05:57 2017 jamieUpdateCDStest of new daqd code on fb1
Quote:

I'm not sure if this is related, but since today morning, I've noticed that the data concentrator errors have returned. Looking at daqd.log, there is a 1 second timing mismatch error that is being generated. Usually, manually running ntpdate on the front ends fixes this problem, but it did not work today.

If this problem started before ~4pm on Friday then it's probably unrelated, since I didn't start any of these tests until after that.  If unexplained problem persist then we can try shutting of the fb1 daqd and see if that helps.

  12770   Mon Jan 30 18:41:41 2017 jamieUpdateCDSTEST ABORTED of new daqd code on fb1

I just aborted the fb1 test and reverted everything to the nominal configuration.  Everything looks to be operating nominally.  Front ends are mostly green except for c1rfm and c1asx which are currently not being acquired by the DAQ, and an unknown IPC error with c1daf.  Please let me know if any unusual problems are encountered.

The behavior of daqd on fb1 with the latest release (3.2.1) was not improved.  After turning on the full pipe it was back to crashing every 10 minutes or so when the full and second trend frames were being written out.  lame.  back to the drawing board...

  12794   Fri Feb 3 11:03:06 2017 jamieUpdateCDSmore testing fb1; DAQ DOWN DURING TEST

More testing of fb1 today.  DAQ DOWN UNTIL FURTHER NOTICE.

Testing Wednesday did not resolve anything, but Jonathan Hanks is helping.

  12798   Sat Feb 4 12:20:39 2017 jamieSummaryCDS/cvs/cds/caltech/chans back on svn1.6
Quote:

True - its an issue. Koji and I are updating zita into Ubuntu16 LTS. If it looks like its OK with various tools we'll swap over the others into it. Until then I figure we're best off turning allegra back into Ubuntu12 to avoid a repeat of this kind of conflict. Once the workstations in the LLO control room are running smoothly on a new OS for a year, we can transfer into that. I don't think any of us wants to be the CDS beta tester for DV or DTT.

Just to be clear, since there seems to be some confusion, the SVN issue has nothing to do with Debian vs. Ubuntu.  SVN made non-backwards compatible changes to their working copy data format that breaks newer checkouts with older clients.  You will run into the exact same problem with newer Ubuntu versions.

I recommend the 40m start moving towards the reference operating systems (Debian 8 or SL7) as that's where CDS is moving.  By moving to newer Ubuntu versions you're moving away from CDS support, not towards it.

  12799   Sat Feb 4 12:29:20 2017 jamieSummaryCDS/cvs/cds/caltech/chans back on svn1.6

No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.

Quote:
Quote:

True - its an issue. Koji and I are updating zita into Ubuntu16 LTS. If it looks like its OK with various tools we'll swap over the others into it. Until then I figure we're best off turning allegra back into Ubuntu12 to avoid a repeat of this kind of conflict. Once the workstations in the LLO control room are running smoothly on a new OS for a year, we can transfer into that. I don't think any of us wants to be the CDS beta tester for DV or DTT.

Just to be clear, since there seems to be some confusion, the SVN issue has nothing to do with Debian vs. Ubuntu.  SVN made non-backwards compatible changes to their working copy data format that breaks newer checkouts with older clients.  You will run into the exact same problem with newer Ubuntu versions.

I recommend the 40m start moving towards the reference operating systems (Debian 8 or SL7) as that's where CDS is moving.  By moving to newer Ubuntu versions you're moving away from CDS support, not towards it.

 

  12800   Sat Feb 4 12:50:01 2017 jamieSummaryCDS/cvs/cds/caltech/chans back on svn1.6
Quote:

No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.

Ubuntu16 is not to my knowledge used for any CDS system anywhere.  I'm not sure how you expect to have better support for that.  There are no pre-compiled packages of any kind available for Ubuntu16.  Good luck, you big smelly doofuses. Nyah, nyah, nyah.

  13108   Mon Jul 10 21:03:48 2017 jamieUpdateGeneralAll FEs down

 

Quote:
 

However, FB still will not boot up. The error is identical to that discussed in this thread by Intel. It seems FB is having trouble finding its boot disk. I was under the impression that only the FE machines were diskless, and that FB had its own local boot disk - in which case I don't know why this error is showing up. According to the linked thread, it could also be a problem with the network card/cable, but I saw both lights on the network switch port FB is connected to turn green when I powered the machine on, so this seems unlikely. I tried following the steps listed in the linked thread but got nowhere, and I don't know enough about how FB is supposed to boot up, so I am leaving things in this state now. 

It's possible the fb bios got into a weird state.  fb definitely has it's own local boot disk (*not* diskless boot).  Try to get to the BIOS during boot and make sure it's pointing to it's local disk to boot from.

If that's not the problem, then it's also possible that fb's boot disk got fried in the power glitch.  That would suck, since we'd have to rebuild the disk.  If it does seem to be a problem with the boot disk then we can do some invasive poking to see if we can figure out what's up with the disk before rebuilding.

  13115   Wed Jul 12 14:52:32 2017 jamieUpdateGeneralAll FEs down

I just want to mention that the situation is actually much more dire than we originally thought.  The diskless NFS root filesystem for all the front-ends was on that fb disk.  If we can't recover it we'll have to rebuilt the front end OS as well.

As of right now none of the front ends are accessible, since obviously their root filesystem has disappeared.

  13151   Sat Jul 29 16:24:55 2017 jamieUpdateGeneralPSL StripTool flatlined
Quote:
Unrelated to this work: It looks like some/all of the FE models were re-started. The x3 gain on the coil outputs of the 2 ITMs and BS, which I had manually engaged when I re-aligned the IFO on Monday, were off, and in general, the IMC and IFO alignment seem much worse now than it was yesterday. I will do the re-alignment later as I'm not planning to use the IFO today.

This was me.  I restarted the front ends when I was getting the MX streams working yesterday.  I'll try to me more conscientious about logging front end restarts.

  13340   Thu Sep 28 11:13:32 2017 jamieUpdateCDS40m files backup situation

 

Quote:

After consulting with Jamie, we reached the conclusion that the reason why the root of FB1 is so huge is because of the way the RAID for /frames is setup. Based on my googling, I couldn't find a way to exclude the nfs stuff while doing a backup using dd, which isn't all that surprising because dd is supposed to make an exact replica of the disk being cloned, including any empty space. So we don't have that flexibility with dd. The advantage of using dd is that if it works, we have a plug-and-play clone of the boot disk and root filesystem which we can use in the event of a hard-disk failure.

  1. One option would be to stop all the daqd processes, unmount /frames, and then do a dd backup of the true boot disk and root filesystem.
  2. Another option would be to use rsync to do the backup - this way we can selectively copy the files we want and ignore the nfs stuff. I suspect this is what we will have to do for the second layer of backup we have planned, which will be run as a daily cron job. But I don't think this approach will give us a plug-and-play replacement disk in the event of a disk failure.
  3. Third option is to use one of the 2TB HGST drives, and just do a dd backup - some of this will be /frames, but that's okay I guess.

This is not quite right.  First of all, /frames is not NFS.  It's a mount of a local filesystem that happens to be on a RAID.  Second, the frames RAID is mounted at /frames.  If you do a dd of the underlying block device (in this case /dev/sda*, you're not going to copy anything that's mounted on top of it.

What i was saying about /frames is that I believe there is data in the underlying directory /frames that the frames RAID is mounted on top of.  In order to not get that in the copy of /dev/sda4 you would need to unmount the frames RAID from /frames, and delete everything from the /frames directory.  This would not harm the frames RAID at all.

But it doesn't really matter because the backup disk has space to cover the whole thing so just don't worry about it.  Just dd /dev/sda to the backup disk and you'll just be copying the root filesystem, which is what we want.

  13344   Fri Sep 29 09:43:52 2017 jamieHowToCDSpyawg

 

Quote:

I've modified the __init.py__ file located at /ligo/apps/linux-x86_64/cdsutils-480/lib/python2.7/site-packages/cdsutils/__init__.py so that you can now simply import pyawg from cdsutils. On the control room workstations, iPython is set up such that cdsutils is automatically imported as "cds". Now this import also includes the pyawg stuff. So to use some pyawg function, you would just do (for example):

exc=cds.awg.ArbitraryLoop(excChan,excit,rate=fs)

One could also explicitly do the import if cdsutils isn't automatically imported:

from cdsutils import awg

pyawg-away!


Linking this useful instructional elog from Chris here: https://nodus.ligo.caltech.edu:8081/Cryo_Lab/1748

?  Why aren't you able to just import 'awg' directly?  You shouldn't have to import it through cdsutils.  Something must be funny with the config.

ELOG V3.1.3-