40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 52 of 339  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  12980   Wed May 10 12:37:41 2017 gautamUpdateCDSMCautolocker dead

The MCautolocker had stalled - there were no additional lines to the logfile after 12:17pm (~20mins ago). Normally, it suffices to ssh into megatron and run sudo initctl restart MCautolocker - but it seems that there was no running initctl instance of this, so I had to run sudo initctl start MCautolocker. The FSS Slow control initctl process also seemed to have been terminated, so I ran sudo initctl start FSSslowPy.

It is not clear to me why the initctl instances got killed in the first place, but MC locks fine now.

  12982   Wed May 10 16:57:52 2017 ranaUpdateCDSMCautolocker dead

I rebooted megatron around 12:20 today. It had dozens of stalled medm process (some of them there since February!). I couldn't kill them without them coming back like zombies, so I did sudo reboot.

  12991   Mon May 15 08:26:43 2017 ranaUpdateCDSSVN up in userapps/cds

I did an 'svn update' in userapps/cds/ which pulled in some changes from the sites as well as various CDS utilities in common/ and utilities/

This was to get Keith Thorne's get_data.m and get_data2.m scripts which I tested and they seem to be able to get data. No success with getting minute trend yet, but that may be a user error.

Update Monday 15-May: Our version of NDS client is 0.10 and we need to have 0.14 for this new method to work. Ubuntu12 lscsoft repo doesn't have newer nds client so we'll have to upgrade some OS.

  13012   Thu May 25 12:22:59 2017 gautamUpdateCDSslow machine bootfest

After ~3months without any problems on the slow machine front, I had to reboot c1psl, c1susaux and c1iscaux today. The control room StripTool traces were not being displayed for all the PSL channels so I ran testSlowMachines.bash to check the status of the slow machines, which indicated that these three slow machines were dead. After rebooting the slow machines, I had to burt-restore the c1psl snapshot as usual to get the PMC to lock. Now, both PMC and IMC are locked. I also had to restart the StripTool traces (using scripts/general/startStrip.sh) to get the unresponsive traces back online.

Steve tells me that we probably have to do a reboot of the vacuum slow machines sometime soon too, as the MEDM screen for the Vacuum indicator channels are unresponsive.

Quote:

Had to reboot c1psl, c1susaux, c1auxex, c1auxey and c1iscaux today. PMC has been relocked. ITMX didn't get stuck. According to this thread, there have been two instances in the last 10 days in which c1psl and c1susaux have failed. Since we seem to be doing this often lately, I've made a little script that uses the netcat utility to check which slow machines respond to telnet, it is located at /opt/rtcds/caltech/c1/scripts/cds/testSlowMachines.bash.

 

 

  13028   Thu Jun 1 15:37:01 2017 gautamUpdateCDSslow machine bootfest

Steve alerted me that the IMC wouldn't lock. Reboots for c1susaux, c1iool0 today. I tried using the reset button instead of keying the crates. This worked for c1iool0, but not for c1susaux. So I had to key the latter crate. The machine took a good 5-10 minutes before coming back up, but eventually it did. Now IMC locks fine.

  13059   Mon Jun 12 10:34:10 2017 gautamUpdateCDSslow machine bootfest

Reboots for c1susaux, c1iscaux, c1auxex today. I took this opportunity to squish the Sat. Box. Cabling for MC2 (both on the Sat box end and also the vacuum feedthrough) as some work has been recently ongoing there, maybe something got accidently jiggled during the process and was causing MC2 alignment to jump around.

Relocked PMC to offload some of the DC offset, and re-aligned IMC after c1susaux reboot. PMC and IMC transmission back to nominal levels now. Let's see if MC2 is better behaved after this sat. box. voodoo.

Interestingly, since Feb 6, there were no slow machine reboots for almost 3 months, while there have been three reboots in the last three weeks. Not sure what (if anything) to make of that.

  13069   Fri Jun 16 13:53:11 2017 gautamUpdateCDSslow machine bootfest

Reboots for c1psl, c1iool0, c1iscaux today. MC autolocker log was complaining that the C1:IOO-MC_AUTOLOCK_BEAT EPICS channel did not exist, and running the usual slow machine check script revealed that these three machines required reboots. PMC was relocked, IMC Autolocker was restarted on Megatron and everything seems fine now.

 

  13096   Wed Jul 5 16:09:34 2017 gautamUpdateCDSslow machine bootfest

Reboots for c1susaux, c1iscaux today.

 

  13125   Wed Jul 19 08:37:21 2017 JamieUpdateCDSUpdate on front-end/DAQ rebuild

After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ).  The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO.  We're trying to get the front ends working first, and will work on recovering daqd after.

Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image.  We setup fb1 as the new boot server, and were able to get front ends booting again.  Unfortunately, we've been having trouble running and building models, so something is still amis.  We've been taking a three-pronged approach to getting the front ends running:

  • /diskless/root.fb: This involves booting the front ends from the backup of the diskless root from fb.  Runs gentoo kernel 2.6.34.1.  This should correspond to the environment that all models were built and running against.  But something is missing in the configuration.  The front ends were also mounting /opt from fb, which included the dolphin drivers, and we don't have a copy of that, so models aren't loading or recompiling.
  • /diskless/root.x1boot: Keith sent a disk image of the entire x1boot server from LLO.  It uses gentoo kernel 3.0.8.  This ostensibly includes everything we should need to run the front ends, but it's unfortunately configured with newer versions of some of the software and also isn't loading our existing models or building new ones.  This also seems to be having issues with the dolphin drivers.
  • /diskless/root.jessie: This is an entirely new boot image build from scratch with Debian jessie, using an RTS-patched 3.2 kernel.  This would use the latest versions of everything.  It's mostly working, we just need to rebuild the dolphin driver and source.

It seems that in all cases we need to rebuild the dolphin drivers from source.

  13127   Wed Jul 19 14:26:50 2017 JamieUpdateCDSUpdate on front-end/DAQ rebuild

 

Quote:

After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ).  The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO.  We're trying to get the front ends working first, and will work on recovering daqd after.

Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image.  We setup fb1 as the new boot server, and were able to get front ends booting again.  Unfortunately, we've been having trouble running and building models, so something is still amis.  We've been taking a three-pronged approach to getting the front ends running:

  • /diskless/root.fb: This involves booting the front ends from the backup of the diskless root from fb.  Runs gentoo kernel 2.6.34.1.  This should correspond to the environment that all models were built and running against.  But something is missing in the configuration.  The front ends were also mounting /opt from fb, which included the dolphin drivers, and we don't have a copy of that, so models aren't loading or recompiling.
  • /diskless/root.x1boot: Keith sent a disk image of the entire x1boot server from LLO.  It uses gentoo kernel 3.0.8.  This ostensibly includes everything we should need to run the front ends, but it's unfortunately configured with newer versions of some of the software and also isn't loading our existing models or building new ones.  This also seems to be having issues with the dolphin drivers.
  • /diskless/root.jessie: This is an entirely new boot image build from scratch with Debian jessie, using an RTS-patched 3.2 kernel.  This would use the latest versions of everything.  It's mostly working, we just need to rebuild the dolphin driver and source.

It seems that in all cases we need to rebuild the dolphin drivers from source.

To clarify, we're able to boot the x1boot image with the existing 2.6.25 kernel that we have from fb.  The issue with the root.x1boot image is not the kernel version but some of the other support libraries, such as dolphin.

  13130   Fri Jul 21 18:03:17 2017 JamieUpdateCDSUpdate on front-end/DAQ rebuild

Update:

  • front ends booting with the new Debian jessie diskless root image and a linux 3.2 version of the RTS-patched kernel
  • dolphin is configured correctly and running on c1lsc and c1sus
  • models building and running with RCG 3.0.3

Up next:

  • add c1ioo to the dolphin network
  • recompile/restart all front end models
  • daqd

I'll try to get the first two of those done tomorrow, although it's unclear what model updates we'll have to do to get things working with the newer RCG.

 

  13133   Sun Jul 23 22:16:55 2017 Jamie, gautamUpdateCDSfront-end now running with new OS, RCG

All front ends and model are (mostly) running now

All suspensions are damped:

It should be possible at this point to do more recovery, like locking the MC.

Some details on the restore process:

  • all models were recompiled with the new RCG version 3.0.3
  • the new RCG does stricter simulink drawing checks, and was complaining about unterminated outputs in some of the SUS models.  Terminated all outputs it was concerned about and saved.
  • RCG 3.0 requires a new directory for doing better filter module diagnostics: /opt/rtcds/caltech/c1/chans/tmp
  • had to reset the slow machines c1susaux, c1auxex, c1auxey

The daqd is not yet running.  This is the next task.

I have been taking copious notes and will fully document the restore process once complete.

c1ioo issues

c1ioo has been giving us a little bit of trouble.  The c1ioo model kept crashing and taking down the whole c1ioo host.  We found a red light on one of the ADCs (ADC1).  We pulled the card and replaced it with a spare from the CDS cabinet.  That seemed to fix the problem and c1ioo became more stable.

We've still been seeing a lot of glitching in c1ioo, though, with CPU cycle times frequently (every couple of seconds) running above threshold for all models, up to 200 us.  I tried unloading every kernel module I could and shutting down every non-critical process, but nothing seemed to help.

We eventually tried stopping the c1ioo model altogether and that seemed to help quite a bit, dropping the long cycle rate down to something like one every 30 seconds or so.  Not sure what that means.  We should look into the BIOS again, to see if there could be something interacting with the newer kernel.

So currently the c1ioo model is not running (which is why it's all white in the CDS overview snapshot above).  The fact that c1ioo is not running and the remaining models are still occaissionly glitching is also causing various IPC errors on auxilliary models (see c1mcs, c1rfm, c1ass, c1asx). 

RCG compile warnings

the new RCG tries to do more checks on custom c code, but it seems to be having trouble finding our custom "ccodeio.h" files that live with the c definitions in USERAPPS/*/common/src/.  Unclear why yet.  This is causing the RCG to spit out warnings like the following:

Cannot verify the number of ins/outs for C function BLRMS.
    File is /opt/rtcds/userapps/release/cds/c1/src/BLRMSFILTER.c
    Please add file and function to CDS_SRC or CDS_IFO_SRC ccodeio.h file.

This are just warnings and will not prevent the model form compiling or warning.  We'll figure out what the problem is to make these go away, but they can be ignored for the time being.

model unload instability

Probably the worst problem we're facing right now is an instability that will occaissionally, but not always, cause the entire front end host to freeze up upon unloading an RTS kernel module.  This is a known issue with the newer linux kernels (we're using kernel version 3.2.35), and is being looked into.

This is particularly annoying with the machines on the dolphin network, since if one of the dolphin hosts goes down it manages to crash all the models reading from the dolphin network.  Since half the time they can't be cleanly restarted, this tends to cause a boot fest with c1sus, c1lsc, and c1ioo.  If this happens, just restart those machines, wait till they've all fully booted, then restart all the models on all hosts with "rtcds start all".

  13135   Mon Jul 24 10:45:23 2017 gautamUpdateCDSc1iscex models died

This morning, all the c1iscex models were dead. Attachment #1 shows the state of the cds overview screen when I came in. The machine itself was ssh-able, so I just restarted all the models and they came back online without fuss.

Quote:

All front ends and model are (mostly) running now

Attachment 1: c1iscexFailure.png
c1iscexFailure.png
  13136   Mon Jul 24 10:59:08 2017 JamieUpdateCDSc1iscex models died
Quote:

This morning, all the c1iscex models were dead. Attachment #1 shows the state of the cds overview screen when I came in. The machine itself was ssh-able, so I just restarted all the models and they came back online without fuss.

This was me.  I had rebooted that machine and hadn't restarted the models.  Sorry for the confusion.

  13138   Mon Jul 24 19:28:55 2017 JamieUpdateCDSfront end MX stream network working, glitches in c1ioo fixed

MX/OpenMX network running

Today I got the mx/open-mx networking working for the front ends.  This required some tweaking to the network interface configuration for the diskless front ends, and recompiling mx and open-mx for the newer kernel.  Again, this will all be documented.

controls@fb1:~ 0$ /opt/mx/bin/mx_info
MX Version: 1.2.16
MX Build: root@fb1:/opt/src/mx-1.2.16 Mon Jul 24 11:33:57 PDT 2017
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
    8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0:  364.4 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
    Status:        Running, P0: Link Up
    Network:    Ethernet 10G

    MAC Address:    00:60:dd:43:74:62
    Product code:    10G-PCIE-8B-S
    Part number:    09-04228
    Serial number:    485052
    Mapper:        00:60:dd:43:74:62, version = 0x00000000, configured
    Mapped hosts:    6

                                                        ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                        P0
-----    -----------     ---------                        ---
   0) 00:60:dd:43:74:62 fb1:0                             1,0
   1) 00:30:48:be:11:5d c1iscex:0                         1,0
   2) 00:30:48:bf:69:4f c1lsc:0                           1,0
   3) 00:25:90:0d:75:bb c1sus:0                           1,0
   4) 00:30:48:d6:11:17 c1iscey:0                         1,0
   5) 00:14:4f:40:64:25 c1ioo:0                           1,0
controls@fb1:~ 0$

c1ioo timing glitches fixed

I also checked the BIOS on c1ioo and found that the serial port was enabled, which is known to cause timing glitches.  I turned off the serial port (and some power management stuff), and rebooted, and all the c1ioo timing glitches seem to have gone away.

It's unclear why this is a problem that's just showing up now.  Serial ports have always been a problem, so it seems unlikely this is just a problem with the newer kernel.  Could the BIOS have somehow been reset during the power glitch?

In any event, all the front ends are now booting cleanly, with all dolphin and mx networking coming up automatically, and all models running stably:

Now for daqd...

  13139   Mon Jul 24 19:57:54 2017 gautamUpdateCDSIMC locked, Autolocker re-enabled

Now that all the front end models are running, I re-aligned the IMC, locked it manually, and then tweaked the alignment some more. The IMC transmission now is hovering around 15300 counts. I re-enabled the Autolocker and FSS Slow loops on Megatron as well.

Quote:

MX/OpenMX network running

Today I got the mx/open-mx networking working for the front ends.  This required some tweaking to the network interface configuration for the diskless front ends, and recompiling mx and open-mx for the newer kernel.  Again, this will all be documented.

 

  13145   Wed Jul 26 19:13:07 2017 JamieUpdateCDSdaqd showing same instability as before

I recompiled daqd on the updated fb1, similar to how I had before, and we're seeing the same instability: process crashes when it tries to write out the second trend (technically it looks like it crashes while it's trying to write out the full frame while the second trend is also being written out).  Jonathan Hanks and I are actively looking into it and i'll provide further report soon.

  13149   Fri Jul 28 20:22:41 2017 JamieUpdateCDSpossible stable daqd configuration with separate DC and FW

This week Jonathan Hanks and I have been trying to diagnose why the daqd has been unstable in the configuration used by the 40m, with data concentrator (dc) and frame writer (fw) in the same process (referred to generically as 'fb').  Jonathan has been digging into the core dumps and source to try to figure out what's going on, but he hasn't come up with anything concrete yet.

As an alternative, we've started experimenting with a daqd configuration with the dc and fw components running in separate processes, with communication over the local loopback interface.  The separate dc/fw process model more closely matches the configuration at the sites, although the sites put dc and fwprocesses on different physical machines.  Our experimentation thus far seems to indicate that this configuration is stable, although we haven't yet tested it with the full configuration, which is what I'm attempting to do now.

Unfortunately I'm having trouble with the mx_stream communication between the front ends and the dc process.  The dc does not appear to be receiving the streams from the front ends and is producing a '0xbad' status message for each.  I'm investigating.

  13152   Mon Jul 31 15:13:24 2017 gautamUpdateCDSFB ---> FB1

[jamie, gautam]

In order to test the new daqd config that Jamie has been working on, we felt it would be most convenient for the host name "fb" (martian network IP 192.168.113.202) to point to the physical machine "fb1" (martian network IP 192.168.113.201).

I made this change in /var/lib/bind/martian.hosts on chiara, and then ran sudo service bind9 restart. It seems to have done the job. So as things stand, both hostnames "fb" and "fb1" point to 192.168.113.201.

Now, when starting up DTT or dataviewer, the NDS server is automatically found.

More details to follow.

  13153   Mon Jul 31 18:44:40 2017 JamieUpdateCDSCDS system essentially fully recovered

The CDS system is mostly fully recovered at this point.  The mx_streams are all flowing from all front ends, and from all models, and the daqd processes are receiving them and writing the data to frames:

Remaining unresolved issues:

  • IFO needs to be fully locked to make sure ALL components of all models are working.
  • The remaining red status lights are from the "FB NET" diagnostics, which are reflecting a missing status bit from the front end processes due to the fact that they were compiled with an earlier RCG version (3.0.3) than the mx_streams were (3.3+/trunk).  There will be a new release of the RTS soon, at which point we'll compile everything from the same version, which should get us all green again.
  • The entire system has been fully modernized, to the target CDS reference OS (Debian jessie) and more recent RCG versions.  The management of the various RTS components, both on the front ends and on fb, have as much as possible been updated to use the modern management tools (e.g. systemd, udev, etc.).  These changes need to be documented.  In particular...
  • The fb daqd process has been split into three separate components, a configuration that mirrors what is done at the sites and appears to be more stable: The "target" directory for all of these components is now:
    • daqd_dc: data concentrator (receives data from front ends)
    • daqd_fw: receives frames from dc and writes out full frames and second/minute trends
    • daqd_rcv: NDS1 server (raises test points and receives archive data from frames from 'nds' process)
    The "target" directory for all of these new components is:
    • /opt/rtcds/caltech/c1/target/daqd
    All of these processes are now managed under systemd supervision on fb, meaning the daqd restart procedure has changed.  This needs to be simplified and clarified.
  • Second trend frames are being written, but for some reason they're not accessible over NDS.
  • Have not had a chance to verify minute trend and raw minute trend writing yet.  Needs to be confirmed.
  • Get wiper script working on new fb.
  • Front end RTS kernel will occaissionally crash when the RTS modules are unloaded.  Keith Thorne apparently has a kernel version with a different set of patches from Gerrit Kuhn that does not have this problem.  Keith's kernel needs to be packaged and installed in the front end diskless root.
  • The models accessing the dolphin shared memory will ALL crash when one of the front end hosts on the dolphin network goes away.  This results in a boot fest of all the dolphin-enabled hosts.  Need to figure out what's going on there.
  • The RCG settings snapshotting has changed significantly in later RCG versions.  We need to make sure that all burt backup type stuff is still working correctly.
  • Restoration of /frames from old fb SCSI RAID?
  • Backup of entirety of fb1, including fb1 root (/) and front end diskless root (/diskless)
  • Full documentation of rebuild procedure from Jamie's notes.
  13161   Thu Aug 3 00:59:33 2017 gautamUpdateCDSNDS2 server restarted, /frames mounted on megatron

[Koji, Nikhil, Gautam]

We couldn't get data using python nds2. There seems to have been many problems.

  1. /frames wasn't mounted on megatron, which was the nds2 server. Solution: added /frames 192.168.113.209(sync,ro,no_root_squash,no_all_squash,no_subtree_check) to /etc/exportfs on fb1, followed by sudo exportfs -ra. Using showmount -e, we confirmed that /frames was being exported.
  2. Edited /etc/fstab on megatron to be fb1:/frames/ /frames nfs ro,bg,soft 0 0. Tried to run mount -a, but console stalled.
  3. Used nfsstat -m on megatron. Found out that megatron was trying to mount /frames from old FB (192.168.113.202). Used sudo umount -f /frames to force unmount /frames/ (force was required).
  4. Re-ran mount -a on megatron.
  5. Killed nds2 using /etc/init.d/nds2 stop - didn't work, so we manually kill -9'ed it.
  6. Restarted nds2 server using /etc/init.d/nds2 start.
  7. Waited for ~10mins before everything started working again. Now usual nds2 data getting methods work.

I have yet to check about getting trend data via nds2, can't find the syntax. EDIT: As Jamie mentioned in his elog, the second trend data is being written but is inaccessible over nds (either with dataviewer, which uses fb as the ndsserver, or with python NDS, which uses megatron as the ndsserver). So as of now, we cannot read any kind of trends directly, although the full data can be downloaded from the past either with dataviewer or python nds2. On the control room workstations, this can also be done with cds.getdata.

  13162   Thu Aug 3 10:51:32 2017 ranaUpdateCDSNDS2 server restarted, /frames mounted on megatron

same issue on NODUS; I edited the /etc/fstab and tried mount -a, but it gives this error:

controls@nodus|~ 1> sudo mount -a
mount.nfs: access denied by server while mounting fb1:/frames

needs more debugging - this is the machine that allows us to have backed up frames in LDAS. Permissions issues from fb1 ?

  13163   Thu Aug 3 11:11:29 2017 gautamUpdateCDSNDS2 server restarted, /frames mounted on nodus

I added nodus' eth0 IP (192.168.113.200) to the list of allowed nfs clients in /etc/exportfs on fb1, and then ran sudo mount -a on nodus. Now /frames is mounted.

Quote:

needs more debugging - this is the machine that allows us to have backed up frames in LDAS. Permissions issues from fb1 ?

 

  13164   Thu Aug 3 19:46:27 2017 JamieUpdateCDSnew daqd restart procedure

This is the daqd restart procedure:

$ ssh fb1 sudo systemctl restart daqd_*

That will restart all of the daqd services (daqd_dc, daqd_fw, daqd_rcv).

The front end mx_stream processes should all auto-restart after the daqd_dc comes back up.  If they don't (models show "0x2bad" on DC0_*_STATUS) then you can execute the following to restart the mx_stream process on the front end:

$ ssh c1<host> sudo systemctl restart mx_stream

 

 

  13165   Thu Aug 3 20:15:11 2017 JamieUpdateCDSdataviewer can not raise test points

For some reason dataviewer is not able to raise test points with the new daqd setup, even though dtt can.  If you raise a test point with dtt then dataviewer can show the data fine.

It's unclear to me why this would be the case.  It might be that all the versions of dataviewer on the workstations are too old??  I'll look into it tomorrow to see if I can figure out what's going on.

  13166   Fri Aug 4 09:07:28 2017 ranaUpdateCDSCDS system essentially NOT fully recovered

Tried getting trends with dataviewer just now since Jamie re-enabled the minute_raw frame writing yesterday. Unable to get trends still:

Connecting to NDS Server fb1 (TCP port 8088)
Connecting.... done
Server error 18: trend data is not available
datasrv: DataWriteTrend failed in daq_send().
unknown error returned from daq_send()T0=17-08-04-08-02-22; Length=28800 (s)
No data output.

  13185   Thu Aug 10 14:25:52 2017 gautamUpdateCDSSlow EPICS channels -> Frames re-enabled

I went into /opt/rtcds/caltech/c1/target/daqd, opened the master file, and uncommented the line with C0EDCU.ini (this is the file in which all the slow machine channels are defined). So now I am able to access, for example, the c1vac1 channels.

The location of the master file is no longer in /opt/rtcds/caltech/c1/target/fb, but is in the above mentioned directory instead. This is part of the new daqd paradigm in which separate processes are handling the data transfer between FEs and FB, and the actual frame-writing. Jamie will explain this more when he summarizes the CDS revamp.

It looks like trend data is also available for these newly enabled channels, but thus far, I've only checked second trends. I will update with a more exhaustive check later in the evening.

So, the two major pending problems (that I can think of) are:

  1. Inability to unload models cleanly
  2. Inability of dataviewer (and cdsutils) to open testpoints.

Apart from this, dataviewer frequently hangs on Donatella at startup. I used ipcs -a | grep 0x | awk '{printf( "-Q %s ", $1 )}' | xargs ipcrm to remove all the extra messages in the dataviewer queue.


Restarting the daqd processes on fb1 using Jamie's instructions from earlier in this thread works - but the mx_stream processes do not seem to come back automatically on c1lsc, c1sus and c1ioo (reasons unknown). I've made a copy of the mxstreamrestart.sh script with the new mxstream restart commands, called mxstreamrestart_debian.sh, which lives in /opt/rtcds/caltech/c1/scripts/cds. I've also modified the CDS overview MEDM screen such that the "mxstream restart" calls this modified script. For now, this requires you to enter the controls password for each machine. I don't know what is a secure way to do it otherwise, but I recall not having to do this in the past with the old mxstreamrestart.sh script.

  13189   Fri Aug 11 00:10:03 2017 gautamUpdateCDSSlow EPICS channels -> Frames re-enabled

Seems like something has failed after I did this - full frames are no longer on Aug 10 being written since ~2.30pm PDT. I found out when I tried to download some of the free-swinging MC1 data.

To clarify, I logged into fb1, and ran sudo systemctl restart daqd_*. The only change I made was to uncomment the line quoted below in the master file.

Looking at the log using systemctl, I see the following (I just tried restarting the daqd processes again):

Aug 11 00:00:31 fb1 daqd_fw[16149]: LDASUnexpected::unexpected: Caught unexpected exception      "This is a bug. Please log an LDAS problem report including this message.
Aug 11 00:00:31 fb1 daqd_fw[16149]: daqd_fw: LDASUnexpected.cc:131: static void LDASTools::Error::LDASUnexpected::unexpected(): Assertion `false' failed.
Aug 11 00:00:32 fb1 systemd[1]: daqd_fw.service: main process exited, code=killed, status=6/ABRT
Aug 11 00:00:32 fb1 systemd[1]: Unit daqd_fw.service entered failed state.
Aug 11 00:00:32 fb1 systemd[1]: daqd_fw.service holdoff time over, scheduling restart.
Aug 11 00:00:32 fb1 systemd[1]: Stopping Advanced LIGO RTS daqd frame writer...
Aug 11 00:00:32 fb1 systemd[1]: Starting Advanced LIGO RTS daqd frame writer...
Aug 11 00:00:32 fb1 systemd[1]: daqd_fw.service start request repeated too quickly, refusing to start.
Aug 11 00:00:32 fb1 systemd[1]: Failed to start Advanced LIGO RTS daqd frame writer.
Aug 11 00:00:32 fb1 systemd[1]: Unit daqd_fw.service entered failed state.

Oddly, I am able to access second trends for the same channels from the past which will be useful for the MC1 debugging). Not sure whats going on.


The live data grabbing using cdsutils still seems to be working though - so I've kicked MC1 again, and am grabbing 2 hours of data live on Pianosa.

Quote:

I went into /opt/rtcds/caltech/c1/target/daqd, opened the master file, and uncommented the line with C0EDCU.ini (this is the file in which all the slow machine channels are defined). So now I am able to access, for example, the c1vac1 channels.

The location of the master file is no longer in /opt/rtcds/caltech/c1/target/fb, but is in the above mentioned directory instead. This is part of the new daqd paradigm in which separate processes are handling the data transfer between FEs and FB, and the actual frame-writing. Jamie will explain this more when he summarizes the CDS revamp.

It looks like trend data is also available for these newly enabled channels, but thus far, I've only checked second trends. I will update with a more exhaustive check later in the evening.

So, the two major pending problems (that I can think of) are:

  1. Inability to unload models cleanly
  2. Inability of dataviewer (and cdsutils) to open testpoints.

Apart from this, dataviewer frequently hangs on Donatella at startup. I used ipcs -a | grep 0x | awk '{printf( "-Q %s ", $1 )}' | xargs ipcrm to remove all the extra messages in the dataviewer queue.


Restarting the daqd processes on fb1 using Jamie's instructions from earlier in this thread works - but the mx_stream processes do not seem to come back automatically on c1lsc, c1sus and c1ioo (reasons unknown). I've made a copy of the mxstreamrestart.sh script with the new mxstream restart commands, called mxstreamrestart_debian.sh, which lives in /opt/rtcds/caltech/c1/scripts/cds. I've also modified the CDS overview MEDM screen such that the "mxstream restart" calls this modified script. For now, this requires you to enter the controls password for each machine. I don't know what is a secure way to do it otherwise, but I recall not having to do this in the past with the old mxstreamrestart.sh script.

 

  13192   Fri Aug 11 11:14:24 2017 gautamUpdateCDSSlow EPICS channels -> Frames re-enabled

I commented out the line pertaining to C0EDCU again, now full frames are being written again.

But we no longer have access to the slow EPICS records.

I am not sure what the failure mode is here - In the master file, there is a line that says the EDCU list "*MUST* COME *AFTER* ALL OTHER FAST INI DEFINITIONS" which it does. But there are a bunch of lines that are testpoint lists after this EDCU line. I wonder if that is the problem?

Quote:

Seems like something has failed after I did this - full frames are no longer on Aug 10 being written since ~2.30pm PDT. I found out when I tried to download some of the free-swinging MC1 data.

 

  13197   Fri Aug 11 18:53:35 2017 gautamUpdateCDSSlow EPICS channels -> Frames re-enabled
Quote:

Seems like something has failed after I did this - full frames are no longer on Aug 10 being written since ~2.30pm PDT. I found out when I tried to download some of the free-swinging MC1 data.

To clarify, I logged into fb1, and ran sudo systemctl restart daqd_*. The only change I made was to uncomment the line quoted below in the master file.

Looking at the log using systemctl, I see the following (I just tried restarting the daqd processes again):

Aug 11 00:00:31 fb1 daqd_fw[16149]: LDASUnexpected::unexpected: Caught unexpected exception      "This is a bug. Please log an LDAS problem report including this message.
Aug 11 00:00:31 fb1 daqd_fw[16149]: daqd_fw: LDASUnexpected.cc:131: static void LDASTools::Error::LDASUnexpected::unexpected(): Assertion `false' failed.
Aug 11 00:00:32 fb1 systemd[1]: daqd_fw.service: main process exited, code=killed, status=6/ABRT
Aug 11 00:00:32 fb1 systemd[1]: Unit daqd_fw.service entered failed state.
Aug 11 00:00:32 fb1 systemd[1]: daqd_fw.service holdoff time over, scheduling restart.
Aug 11 00:00:32 fb1 systemd[1]: Stopping Advanced LIGO RTS daqd frame writer...
Aug 11 00:00:32 fb1 systemd[1]: Starting Advanced LIGO RTS daqd frame writer...
Aug 11 00:00:32 fb1 systemd[1]: daqd_fw.service start request repeated too quickly, refusing to start.
Aug 11 00:00:32 fb1 systemd[1]: Failed to start Advanced LIGO RTS daqd frame writer.
Aug 11 00:00:32 fb1 systemd[1]: Unit daqd_fw.service entered failed state.

Oddly, I am able to access second trends for the same channels from the past which will be useful for the MC1 debugging). Not sure whats going on.


The live data grabbing using cdsutils still seems to be working though - so I've kicked MC1 again, and am grabbing 2 hours of data live on Pianosa.

So we tried this again with a fresh build of daqd_fw, and it still fails.  The error message is pointing to an underlying bug in the framecpp library ("LDASTools"), which may be tricky to solve.  I'm rustling the appropriate bushes...

  13198   Fri Aug 11 19:34:49 2017 JamieUpdateCDSCDS final bits status update

So it appears we now have full frames and second, minute, and minute_raw trends.

We are still not able to raise test points with daqd_rcv (e.g. the NDS1 server), which is why dataviewer and nds2-client can't get test points on their own.

We were not able to add the EDCU (EPICS client) channels without daqd_fw crashing.

We have a new kernel image that's supposed to solve the module unload instability issue.  In order to try it we'll need to restart the entire system, though, so I'll do that on Monday morning.

I've got the CDS guys investigating the test point and EDCU issues, but we won't get any action on that until next week.

Quote:

Remaining unresolved issues:

  • IFO needs to be fully locked to make sure ALL components of all models are working.
  • The remaining red status lights are from the "FB NET" diagnostics, which are reflecting a missing status bit from the front end processes due to the fact that they were compiled with an earlier RCG version (3.0.3) than the mx_streams were (3.3+/trunk).  There will be a new release of the RTS soon, at which point we'll compile everything from the same version, which should get us all green again.
  • The entire system has been fully modernized, to the target CDS reference OS (Debian jessie) and more recent RCG versions.  The management of the various RTS components, both on the front ends and on fb, have as much as possible been updated to use the modern management tools (e.g. systemd, udev, etc.).  These changes need to be documented.  In particular...
  • The fb daqd process has been split into three separate components, a configuration that mirrors what is done at the sites and appears to be more stable: The "target" directory for all of these components is now:
    • daqd_dc: data concentrator (receives data from front ends)
    • daqd_fw: receives frames from dc and writes out full frames and second/minute trends
    • daqd_rcv: NDS1 server (raises test points and receives archive data from frames from 'nds' process)
    The "target" directory for all of these new components is:
    • /opt/rtcds/caltech/c1/target/daqd
    All of these processes are now managed under systemd supervision on fb, meaning the daqd restart procedure has changed.  This needs to be simplified and clarified.
  • Second trend frames are being written, but for some reason they're not accessible over NDS.
  • Have not had a chance to verify minute trend and raw minute trend writing yet.  Needs to be confirmed.
  • Get wiper script working on new fb.
  • Front end RTS kernel will occaissionally crash when the RTS modules are unloaded.  Keith Thorne apparently has a kernel version with a different set of patches from Gerrit Kuhn that does not have this problem.  Keith's kernel needs to be packaged and installed in the front end diskless root.
  • The models accessing the dolphin shared memory will ALL crash when one of the front end hosts on the dolphin network goes away.  This results in a boot fest of all the dolphin-enabled hosts.  Need to figure out what's going on there.
  • The RCG settings snapshotting has changed significantly in later RCG versions.  We need to make sure that all burt backup type stuff is still working correctly.
  • Restoration of /frames from old fb SCSI RAID?
  • Backup of entirety of fb1, including fb1 root (/) and front end diskless root (/diskless)
  • Full documentation of rebuild procedure from Jamie's notes.
  13205   Mon Aug 14 19:41:46 2017 JamieUpdateCDSfront-end/DAQ network down for kernel upgrade, and timing errors

I'm upgrading the linux kernel for all the front ends to one that is supposedly more stable and won't freeze when we unload RTS models (linux-image-3.2.88-csp).  Since it's a different kernel version it requires rebuilds of all kernel-related support stuff (mbuf, symmetricom, mx, open-mx, dolphin) and all the front end models.  All the support stuff has been upgraded, but we're now waiting on the front end rebuilds, which takes a while.

Initial testing indicates that the kernel is more stable; we're mostly able to unload/reload RTS modules without the kernel freezing.  However, the c1iscey host seems to be oddly problematic and has frozen twice so far on module unloads.  None of the other hosts have frozen on unload (yet), though, so still not clear.

We're now seeing some timing errors between the front ends and daqd, resulting in a "0x4000" status message in the 'C1:DAQ-DC0_*_STATUS' channels.  Part of the problem was an issue with the IRIG-B/GPS receiver timing unit, which I'll log in a separate post.  Another part of the problem was a bug in the symmetricom driver, which has been resolved.  That wasn't the whole problem, though, since we're still seeing timing errors.  Working with Jonathan to resolve.

  13207   Mon Aug 14 20:12:09 2017 Jamie, GautumUpdateCDSWeird problem with GPS receiver

Today we saw a weird issue with the GPS receiver (EndRun Technologies Tempus LX).  GPS timing on fb1 (which is handled via IRIG-B connection to the receiver with a spectracom card) was off by +18 seconds.  We tried resetting the GPS receiver and it still came up with +18 second offset.  To be clear, the GPS receiver unit itself was showing a time on it's front panel that looked close enough to 24-hour UTC, but was off by +18s.  The time also said "GPS" vertically to the right of the time.

We started exploring the settings on the GPS receiver and found this menu item:

Clock -> "Time Mode" -> "UTC"/"GPS"/"Local"

The setting when we found it was "GPS", which seems logical enough.  However, when we switched it to "UTC" the time as shown on the front panel was correct, now with "UTC" vertically to the right of the time, and fb1 was then showing the correct GPS time.

From the manual:

Time Mode
Time mode defines the time format used for the front-panel time display and, if installed, the optional
time code or Serial Time output. The time mode does not affect the NTP output, which is always
UTC. Possible values for the time mode are GPS, UTC, and local time. GPS time is derived from
the GPS satellite system. UTC is GPS time minus the current leap second correction. Local time is
UTC plus local offset and Daylight Savings Time. The local offset and daylight savings time displays
are described below.

The fact that moving to "UTC" fixed the problem, even though that is supposed to remove the leap second correction, might indicate that there's another bug in the symmetricom driver...

  13211   Tue Aug 15 16:32:42 2017 Jamie, GautumUpdateCDSGPS receiver apparently set to correct mode as "UTC"
Quote:

The setting when we found it was "GPS", which seems logical enough.  However, when we switched it to "UTC" the time as shown on the front panel was correct, now with "UTC" vertically to the right of the time, and fb1 was then showing the correct GPS time.

From Keith Thorne:

In the GPS receiver, you are trying to match the IRIG-B output format that is created by the aLIGO IRIG-B Fanout.  Since we have to prep the aLIGO IRIG-B Fanout every time there is a leap second coming, I would suspect that we are sending UTC to the IRIG-B receivers.  Thus, the GPS receiver needs to be set to that mode.

Soooo, "UTC" is the correct mode for the GPS receiver.

  13212   Wed Aug 16 14:54:13 2017 gautamUpdateCDSPSL monitoring Acromag EPICS server restarted

[johannes, gautam, jamie]

  • Made a directory /opt/rtcds/caltech/c1/scripts/Acromag/PSL where I copied over the files needed my modbusApp to start the server from Lydia's user directory
  • Edited /ligo/apps/ubuntu12/ligoapps-user-env.sh to export a couple of EPICS variables to facilitate easy startup of the EPICS server
  • Started a tmux session on (soon to be re-christened?) megatron called "acroEPICS"
  • Ran the following command to start up the EPICS server:
${EPICS_MODULES}/modbus/bin/${EPICS_HOST_ARCH}/modbusApp npro_config.cmd

To do:

  1. Make a startup script that runs the above command - eventually this can contain the initialization instructions for all the Acromags
  2. Figure out the initctl/systemctl stuff to make the server automatically restart if it drops for some reason (e.g. power failure)
  13215   Wed Aug 16 17:05:53 2017 JamieUpdateCDSfront-end/DAQ network down for kernel upgrade, and timing errors

The CDS system has now been up moved to a supposedly more stable real-time-patched linux kernel (3.2.88-csp) and RCG r4447 (roughly the head of trunk, intended to be release 3.4).  With one major and one minor exception, everything seems to be working:

The remaining issues are:

  • RFM network down.  The IOP models on all hosts on the RFM network are not detecting their RFM cards.  Keith Thorne thinks that this is because of changes in trunk to support the new long-range PCIe that will be used at the sites, and that we just need to add a new parameter to the cdsParameters block in models that use RFM.  Him and Rolf are looking into for us.
  • The 3.2.88-csp kernel is still not totally stable.  On most hosts (c1sus, c1ioo, c1iscex) it seems totally fine and we're able to load/unload models without issue.  c1iscey is definitely problematic, frequently freezing on module unload.  There must be a hardware/bios issue involved here.  c1lsc has also shown some problems.  A better kernel is supposedly in the works.
  • NDS clients other than DTT are still unable to raise test points.  This appears to be an issue with the daqd_rcv component (i.e. NDS server) not properly resolving the front ends in the GDS network.  Still looking into this with Keith, Rolf, and Jonathan.

Issues that have been fixed:

  • "EDCU" channels, i.e. non-front-end EPICS channels, are now being acquired properly by the DAQ.  The front-ends now send all slow channels to the daq over the MX network stream.  This means that front end channels should no longer be specified in the EDCU ini file.  There were a couple in there that I removed, and that seemed to fix that issue.
  • Data should now be recorded in all formats: full frames, as well as second, minute, and raw_minute trends
  • All FE and DAQD diagnostics are green (other than the ones indicating the problems with the RFM network).  This was fixed by getting the front ends models, mx_stream processes, and daqd processes all compiled against the same version of the advLigoRTS, and adding the appropriate command line parameters to the mx_stream processes.
  13216   Wed Aug 16 17:14:02 2017 KojiUpdateCDSfront-end/DAQ network down for kernel upgrade, and timing errors

What's the current backup situation?

  13217   Wed Aug 16 18:01:28 2017 JamieUpdateCDSfront-end/DAQ network down for kernel upgrade, and timing errors
Quote:

What's the current backup situation?

Good question.  We need to figure something out.  fb1 root is on a RAID1, so there is one layer of safety.  But we absolutely need a full backup of the fb1 root filesystem.  I don't have any great suggestions, other than just getting an external disk, 1T or so, and just copying all of root (minus NFS mounts).

  13218   Wed Aug 16 18:06:01 2017 KojiUpdateCDSfront-end/DAQ network down for kernel upgrade, and timing errors

We also need to copy chiara's root. What is the best way to get the full image of the root FS?
We may need to restore these root images to a different disk with a different capacity.

Is the dump command good for this?

  13219   Wed Aug 16 18:50:58 2017 JamieUpdateCDSfront-end/DAQ network down for kernel upgrade, and timing errors
Quote:

The remaining issues are:

  • RFM network down.  The IOP models on all hosts on the RFM network are not detecting their RFM cards.  Keith Thorne thinks that this is because of changes in trunk to support the new long-range PCIe that will be used at the sites, and that we just need to add a new parameter to the cdsParameters block in models that use RFM.  Him and Rolf are looking into for us.

RFM network is back!  Everything green again.

Use of RFM has been turned off in adLigoRTS trunk in favor of the new long-range PCIe networking being developed for the sites.  Rolf provided a single-line patch that re-enables it:

controls@c1sus:/opt/rtcds/rtscore/trunk 0$ svn diff
Index: src/epics/util/feCodeGen.pl
===================================================================
--- src/epics/util/feCodeGen.pl    (revision 4447)
+++ src/epics/util/feCodeGen.pl    (working copy)
@@ -122,7 +122,7 @@
 $diagTest = -1;
 $flipSignals = 0;
 $virtualiop = 0;
-$rfm_via_pcie = 1;
+$rfm_via_pcie = 0;
 $edcu = 0;
 $casdf = 0;
 $globalsdf = 0;
controls@c1sus:/opt/rtcds/rtscore/trunk 0$

This patched was applied to RTS source checkout we're using for the FE builds (/opt/rtcds/rtscore/trunk, which is r4447, and is linked to /opt/rtcds/rtscore/release).  The following models that use RFM were re-compiled, re-installed, and re-started:

  • c1x02
  • c1rfm
  • c1x03
  • c1als
  • c1x01
  • c1scx
  • c1asx
  • c1x05
  • c1scy
  • c1tst

The re-compiled models now see the RFM cards (dmesg log from c1ioo):

[24052.203469] c1x03: Total of 4 I/O modules found and mapped
[24052.203471] c1x03: ***************************************************************************
[24052.203473] c1x03: 1 RFM cards found
[24052.203474] c1x03:     RFM 0 is a VMIC_5565 module with Node ID 180
[24052.203476] c1x03: address is 0xffffc90021000000
[24052.203478] c1x03: ***************************************************************************

This cleared up all RFM transmission error messages.

CDS upstream are working to make this RFM usage switchable in a reasonable way.

  13249   Thu Aug 24 17:36:11 2017 gautamUpdateCDSFSS Slow Python maintenance

A couple of weeks ago, I was trying to modernize the python version of the FSS Slow temperature control loops, when I accidentally ended up deleting it frown. There was no svn backup. So the old Perl PID script has been running for the last few days.

Today, I checked out the latest version that Andrew and co. have running in the PSL lab. I had to make some important modifications for the script to work for the 40m setup.

  1. The script is conveniently setup in a way that the channels it needs to read from / write to are read in from an .ini file. I renamed all the channels to match the appropriate 40m ones.
  2. We don't have a soft epics channel in which to define the setpoint for our PID servo (which is 0). Rather than poke around with slow machine EPICS records, I simply commented out this line in the script and included the hard-coded value of 0. When we modernize to the Acromag era, we can setup an EPICS channel + MEDM slider for the setpoint.
  3. The way the Perl script was setup, the error signal was pre-scaled by a factor of 0.01, supposedly to make the PID gains be of order 1. For consistency, I re-inserted this scaling, which awade and co. had removed.
  4. Modified the FSSslowPy.init file to call the script in accordance with the new syntax:
python FSSSlow.py -i FSSSlowPy.ini

Then I stopped the Perl process on megatron by running

sudo initctl stop FSSslow

and started the Python process by running

sudo initctl start FSSslowPy

I have now committed the files FSSSlow.py and FSSSlowPy.ini to the 40m svn.  Things seem to be stable for the last 20 mins or so, let's keep an eye on this though - although we had been running the Python PID loop for some months, this version is a slightly modified one. 

The initctl stuff still isn't very robust - I think both the Autolocker and the FSS slow servos have to be manually restarted if megatron is shutdown/restarted for whatever reason. It doesn't seem to be a problem with the initctl routine itself - looking at the logs, I can see that init is trying to start both processes, but is failing to do so each time. To be investigated. The wiki procedure to restart this process is up to date.

GV Edit 0000 25 Aug 2017: I had to add a line to the script that checks MC transmission before enabling the PID loop. Change has been committed to svn. Now, when the MC loses lock or if the PSL shutter is kept closed for an extended period of time, the temperature loop doesn't rail.

  13262   Mon Aug 28 16:20:00 2017 gautamUpdateCDS40m files backup situation

This elog is meant to summarize the current backup situation of critical 40m files.

What are the critical filesystems? I've also indicated the size of these disks and the volume currently used, and the current backup situation. 

Name

Disk Usage

Description / remarks

Current backup status

FB1 root filesystem 1.7TB / 2TB
  • FB1 is the machine that hosts the diskless root for the front end machines
  • Additionally, it runs the daqd processes which write data from realtime models into frame files
Not backed up
/frames up to 24TB
  • This is where the frame files are written to 
  • Need to setup a wiper script that periodically clears older data so that the disk doesn't overflow.

Not backed up 

LDAS pulls files from nodus daily via rsync, so there's no cron job for us to manage. We just allow incoming rsync.

Shared user area 1.6TB / 2TB
  • /home/cds on chiara
  • This is exported over NFS to 40m workstations, FB1 etc.
  • Contains user directories, scripts, realtime models etc.

Local backup on /media/40mBackup on chiara via daily cronjob

Remote backup to ldas-cit.ligo.caltech.edu::40m/cvs via daily cronjob on nodus

Chiara root filesystem 11GB / 440GB
  • This is the root filesystem for chiara
  • Contains nameserver stuff for the martian network, responsible for rsyncing /home/cds
Not backed up
Megatron root filesystem 39GB / 130GB
  • Boot disk for megatron, which is our scripts machine
  • Runs MC autolocker, FSS loops etc.
  • Also is the nds server for facilitating data access from outside the martian network
Not backed up
Nodus root filesystem 77GB / 355GB
  • This is the boot disk for our gateway machine
  • Hosts Elog, svn, wikis
  • Supposed to be responsible for sending email alerts for NFS disk usage and vacuum system N2 pressure
Not backed up
JETSTOR RAID Array 12TB / 13TB
  • Old /frames
  • Archived frames from DRFPMI locks
  • Long term trends

Currently mounted on Megatron, not backed up.

Then there is Optimus, but I don't think there is anything critical on it. 

So, based on my understanding, we need to back up a whole bunch of stuff, particularly the boot disks and root filesystems for Chiara, Megatron and Nodus. We should also test that the backups we make are useful (i.e. we can recover current operating state in the event of a disk failure).

Please edit this elog if I have made a mistake. I also don't have any idea about whether there is any sort of backup for the slow computing system code.

  13263   Mon Aug 28 17:13:57 2017 ericqUpdateCDS40m files backup situation

In addition to bootable full disk backups, it would be wise to make sure the important service configuration files from each machine are version controlled in the 40m SVN. Things like apache files on nodus, martian hosts and DHCP files on chiara, nds2 configuration and init scripts on megatron, etc. This can make future OS/hardware upgrades easier too.

  13273   Wed Aug 30 10:54:26 2017 gautamUpdateCDSslow machine bootfest

MC autolocker and FSS loops were stuck because c1psl was unresponsive. I rebooted it and did a burtrestore to enable PSL locking. Then the IMC locked fine.

c1susaux and c1iscaux were also unresponsive so I keyed those crates as well, after taking the usual steps to avoid ITMX getting stuck - but it still got stuck when the Sat. Box. connectors were reconnected after the reboot, so I had to shake it loose with bias slider jiggling. This is annoying and also not very robust. I am afraid we are going to knock the ITMX magnets off at some point. Is this problem indicative of the fact that the ITMX magnets were somehow glued on in a skewed way? Or can we make the situation better by just tweaking the OSEM-holding fixtures on the cage?

In any case, I've started listing stuff down here for things we may want to do when we vent next.

 

  13279   Thu Aug 31 00:46:57 2017 ranaSummaryCDSallegra -> Scientific Linux 7.3

I made a 'LiveCD' on a 16 GB USB stick using this command after the GUIs didn't work and looking at some blog posts:

sudo dd if=SL-7.3-x86_64-2017-01-20-LiveCD.iso of=/dev/sdf

Quote:

Debian doesn't like EPICS. Or our XY plots of beam spots...Sad!

Quote:
Quote:

No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.

Ubuntu16 is not to my knowledge used for any CDS system anywhere.  I'm not sure how you expect to have better support for that.  There are no pre-compiled packages of any kind available for Ubuntu16.  Good luck, you big smelly doofuses. Nyah, nyah, nyah.

K Thorne recommends that we use SL7.3 with the 'xfce' window manager instead of the Debian family of products, so we'll try it out on allegra and rossa to see how it works for us. Hopefully the LLO CDS team will be the tip of the spear on solving the usual software problems we have when we "~up" grade.

  13282   Thu Aug 31 18:36:23 2017 gautamUpdateCDSrevisiting Acromag

Current status:

  • There is a single Acromag ADC unit installed in 1X4
  • It is presently hooked up to the PSL NPRO diagnostic connector channels
  • I had (re)-started the acquisiton of these channels on August 16 - but for reasons unknown, the tmux session that was supposed to be running the EPICS server on megatron seems to have died on August 22 (judging by the trend plot of these channels, see Attachment #1)
  • I had not set up an upstart job that restarts the server automatically in such an event. I manually restarted it for now, following the same procedure as linked in my previous elog.
  • While I was at it, I also took the opportunity to edit the Acromag channel names to something more appropriate - all channels previously prefixed with C1:ACRO- have now been prefixed with C1:PSL-

Plan of action:

  1. Hardware - we have, in the lab, in addition to the installed ADC unit
    • 3x 8 channel differential input ADC units
    • 2x 8 channel differential output DAC units
    • 1x 16 channel BIO unit
    • 2U chassis + connectors + breakout boards + other misc hardware that I think Johannes and Lydia procured with the original plan to replace the EX slow controls.
    • Some relevant elogs: Panel designs, breakout design, sketch for proposed layout, preliminary channel list.
      So on the hardware side, it would seem that we have everything we need to go ahead with replacing the EX slow controls with an Acromag system, although Johannes probably knows more about our state of readiness from a hardware PoV.
  2. Software
    • We probably want to get a dedicated machine that will handle the EPICS channel serving for the Acromag system
    • Have to figure out the networking arrangement for such a machine
    • Have to figure out how to set up the EPICS server protocol in such a way that if it drops for whatever reason, it is automatically restarted

 

Attachment 1: Acromag_EPICS.png
Acromag_EPICS.png
  13293   Tue Sep 5 14:41:58 2017 gautamUpdateCDSNDS2 server restarted on megatron

I was unable to download data using nds2. Gabriele had reported similar problems a week ago but I hadn't followed up on this.

I repeated steps 5-7 from elog 13161, and now it seems that I can get data from the nds2 servers again. Unclear why the nds2 server had to be restarted. I wonder if this is somehow related to the mysterious acromag EPICS server tmux session dropout.

  13297   Tue Sep 5 23:02:37 2017 gautamUpdateCDSslow machine bootfest

MC autolocker was not working - PCdrive was railed at its upper rail for ~2 hours judging by the wall StripTool trace. I tried restarting the init processes on megatron, but that didn't fix the problem. The reason seems to have been related to c1iool0 failing - after keying the crate, autolocker came back fine and MC caught lock almost immediately.

Additionally, c1susaux, c1auxex,c1auxey and c1iscaux are also down. I'm not planning on using the IFO tonight so I am not going to reboot these now.

 

  13312   Fri Sep 15 15:54:28 2017 gautamUpdateCDSFB wiper script

A wiper script is not yet set up for our new Frame-Builder. The disk usage is ~80% now, so I think we should start running a wiper script that manages overall disk usage and deletes old frame files to this end.

From what I could find on the elog, the way this was done was by running a cron job on FB. There is a perl script, /opt/rtcds/caltech/c1/target/fb/wiper.pl, which from what I could understand, runs a bunch of du commands on different directories to determine if there is a need to delete any files.

I copied this script over to /opt/rtcds/caltech/c1/target/daqd/wiper.pl. This is the directory in which all the new FB stuff resides. Conveniently, the script has a "dry-run" option, which I tried running on FB1. However, I get the following error message:

Fri Sep 15 15:44:45 PDT 2017
Dry run, will not remove any files!!!
You need to rerun this with --delete argument to really delete frame files
Directory disk usage:
 /frames/trend/minute_rawk
Combined 0k or 0m or 0Gb
Illegal division by zero at ./wiper.pl line 98.


So it would seem that for some reason, the du commands aren't working. From what I could tell, there aren't any directory paths specific to the old FB machine that need to be changed. I believe the script was working prior to the FB disk crash - unfortunately it doesn't look like this script was under version control but I don't think any changes have been made to this script.

Before I go down a Perl rabbit hole, has anyone seen such an error or is aware of some reason why this might not work on the new FB? Am I even using the correct scripts?

  13317   Mon Sep 18 17:17:49 2017 gautamUpdateCDSFB wiper script

After trying to debug this issue using the Perl debugger, I concluded that the problem is in the part of the code that splits the output of the "du" command into directory and disk usage. For whatever, reason, this isn't working. The version of perl running on the new FB1 machine is 5.20.2, whereas I suspect the version running on the old FB machine was 5.14.2 (which is the version on all the Ubuntu 12 workstations and megatron). Unclear whether downgrading the Perl version is the right way to go.

The FB1 disk is now getting close to full, the usage is up to 85% today.

Quote:

Before I go down a Perl rabbit hole, has anyone seen such an error or is aware of some reason why this might not work on the new FB? Am I even using the correct scripts?

 

ELOG V3.1.3-