40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 55 of 346  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  13185   Thu Aug 10 14:25:52 2017 gautamUpdateCDSSlow EPICS channels -> Frames re-enabled

I went into /opt/rtcds/caltech/c1/target/daqd, opened the master file, and uncommented the line with C0EDCU.ini (this is the file in which all the slow machine channels are defined). So now I am able to access, for example, the c1vac1 channels.

The location of the master file is no longer in /opt/rtcds/caltech/c1/target/fb, but is in the above mentioned directory instead. This is part of the new daqd paradigm in which separate processes are handling the data transfer between FEs and FB, and the actual frame-writing. Jamie will explain this more when he summarizes the CDS revamp.

It looks like trend data is also available for these newly enabled channels, but thus far, I've only checked second trends. I will update with a more exhaustive check later in the evening.

So, the two major pending problems (that I can think of) are:

  1. Inability to unload models cleanly
  2. Inability of dataviewer (and cdsutils) to open testpoints.

Apart from this, dataviewer frequently hangs on Donatella at startup. I used ipcs -a | grep 0x | awk '{printf( "-Q %s ", $1 )}' | xargs ipcrm to remove all the extra messages in the dataviewer queue.


Restarting the daqd processes on fb1 using Jamie's instructions from earlier in this thread works - but the mx_stream processes do not seem to come back automatically on c1lsc, c1sus and c1ioo (reasons unknown). I've made a copy of the mxstreamrestart.sh script with the new mxstream restart commands, called mxstreamrestart_debian.sh, which lives in /opt/rtcds/caltech/c1/scripts/cds. I've also modified the CDS overview MEDM screen such that the "mxstream restart" calls this modified script. For now, this requires you to enter the controls password for each machine. I don't know what is a secure way to do it otherwise, but I recall not having to do this in the past with the old mxstreamrestart.sh script.

  13189   Fri Aug 11 00:10:03 2017 gautamUpdateCDSSlow EPICS channels -> Frames re-enabled

Seems like something has failed after I did this - full frames are no longer on Aug 10 being written since ~2.30pm PDT. I found out when I tried to download some of the free-swinging MC1 data.

To clarify, I logged into fb1, and ran sudo systemctl restart daqd_*. The only change I made was to uncomment the line quoted below in the master file.

Looking at the log using systemctl, I see the following (I just tried restarting the daqd processes again):

Aug 11 00:00:31 fb1 daqd_fw[16149]: LDASUnexpected::unexpected: Caught unexpected exception      "This is a bug. Please log an LDAS problem report including this message.
Aug 11 00:00:31 fb1 daqd_fw[16149]: daqd_fw: LDASUnexpected.cc:131: static void LDASTools::Error::LDASUnexpected::unexpected(): Assertion `false' failed.
Aug 11 00:00:32 fb1 systemd[1]: daqd_fw.service: main process exited, code=killed, status=6/ABRT
Aug 11 00:00:32 fb1 systemd[1]: Unit daqd_fw.service entered failed state.
Aug 11 00:00:32 fb1 systemd[1]: daqd_fw.service holdoff time over, scheduling restart.
Aug 11 00:00:32 fb1 systemd[1]: Stopping Advanced LIGO RTS daqd frame writer...
Aug 11 00:00:32 fb1 systemd[1]: Starting Advanced LIGO RTS daqd frame writer...
Aug 11 00:00:32 fb1 systemd[1]: daqd_fw.service start request repeated too quickly, refusing to start.
Aug 11 00:00:32 fb1 systemd[1]: Failed to start Advanced LIGO RTS daqd frame writer.
Aug 11 00:00:32 fb1 systemd[1]: Unit daqd_fw.service entered failed state.

Oddly, I am able to access second trends for the same channels from the past which will be useful for the MC1 debugging). Not sure whats going on.


The live data grabbing using cdsutils still seems to be working though - so I've kicked MC1 again, and am grabbing 2 hours of data live on Pianosa.

Quote:

I went into /opt/rtcds/caltech/c1/target/daqd, opened the master file, and uncommented the line with C0EDCU.ini (this is the file in which all the slow machine channels are defined). So now I am able to access, for example, the c1vac1 channels.

The location of the master file is no longer in /opt/rtcds/caltech/c1/target/fb, but is in the above mentioned directory instead. This is part of the new daqd paradigm in which separate processes are handling the data transfer between FEs and FB, and the actual frame-writing. Jamie will explain this more when he summarizes the CDS revamp.

It looks like trend data is also available for these newly enabled channels, but thus far, I've only checked second trends. I will update with a more exhaustive check later in the evening.

So, the two major pending problems (that I can think of) are:

  1. Inability to unload models cleanly
  2. Inability of dataviewer (and cdsutils) to open testpoints.

Apart from this, dataviewer frequently hangs on Donatella at startup. I used ipcs -a | grep 0x | awk '{printf( "-Q %s ", $1 )}' | xargs ipcrm to remove all the extra messages in the dataviewer queue.


Restarting the daqd processes on fb1 using Jamie's instructions from earlier in this thread works - but the mx_stream processes do not seem to come back automatically on c1lsc, c1sus and c1ioo (reasons unknown). I've made a copy of the mxstreamrestart.sh script with the new mxstream restart commands, called mxstreamrestart_debian.sh, which lives in /opt/rtcds/caltech/c1/scripts/cds. I've also modified the CDS overview MEDM screen such that the "mxstream restart" calls this modified script. For now, this requires you to enter the controls password for each machine. I don't know what is a secure way to do it otherwise, but I recall not having to do this in the past with the old mxstreamrestart.sh script.

 

  13192   Fri Aug 11 11:14:24 2017 gautamUpdateCDSSlow EPICS channels -> Frames re-enabled

I commented out the line pertaining to C0EDCU again, now full frames are being written again.

But we no longer have access to the slow EPICS records.

I am not sure what the failure mode is here - In the master file, there is a line that says the EDCU list "*MUST* COME *AFTER* ALL OTHER FAST INI DEFINITIONS" which it does. But there are a bunch of lines that are testpoint lists after this EDCU line. I wonder if that is the problem?

Quote:

Seems like something has failed after I did this - full frames are no longer on Aug 10 being written since ~2.30pm PDT. I found out when I tried to download some of the free-swinging MC1 data.

 

  13197   Fri Aug 11 18:53:35 2017 gautamUpdateCDSSlow EPICS channels -> Frames re-enabled
Quote:

Seems like something has failed after I did this - full frames are no longer on Aug 10 being written since ~2.30pm PDT. I found out when I tried to download some of the free-swinging MC1 data.

To clarify, I logged into fb1, and ran sudo systemctl restart daqd_*. The only change I made was to uncomment the line quoted below in the master file.

Looking at the log using systemctl, I see the following (I just tried restarting the daqd processes again):

Aug 11 00:00:31 fb1 daqd_fw[16149]: LDASUnexpected::unexpected: Caught unexpected exception      "This is a bug. Please log an LDAS problem report including this message.
Aug 11 00:00:31 fb1 daqd_fw[16149]: daqd_fw: LDASUnexpected.cc:131: static void LDASTools::Error::LDASUnexpected::unexpected(): Assertion `false' failed.
Aug 11 00:00:32 fb1 systemd[1]: daqd_fw.service: main process exited, code=killed, status=6/ABRT
Aug 11 00:00:32 fb1 systemd[1]: Unit daqd_fw.service entered failed state.
Aug 11 00:00:32 fb1 systemd[1]: daqd_fw.service holdoff time over, scheduling restart.
Aug 11 00:00:32 fb1 systemd[1]: Stopping Advanced LIGO RTS daqd frame writer...
Aug 11 00:00:32 fb1 systemd[1]: Starting Advanced LIGO RTS daqd frame writer...
Aug 11 00:00:32 fb1 systemd[1]: daqd_fw.service start request repeated too quickly, refusing to start.
Aug 11 00:00:32 fb1 systemd[1]: Failed to start Advanced LIGO RTS daqd frame writer.
Aug 11 00:00:32 fb1 systemd[1]: Unit daqd_fw.service entered failed state.

Oddly, I am able to access second trends for the same channels from the past which will be useful for the MC1 debugging). Not sure whats going on.


The live data grabbing using cdsutils still seems to be working though - so I've kicked MC1 again, and am grabbing 2 hours of data live on Pianosa.

So we tried this again with a fresh build of daqd_fw, and it still fails.  The error message is pointing to an underlying bug in the framecpp library ("LDASTools"), which may be tricky to solve.  I'm rustling the appropriate bushes...

  13198   Fri Aug 11 19:34:49 2017 JamieUpdateCDSCDS final bits status update

So it appears we now have full frames and second, minute, and minute_raw trends.

We are still not able to raise test points with daqd_rcv (e.g. the NDS1 server), which is why dataviewer and nds2-client can't get test points on their own.

We were not able to add the EDCU (EPICS client) channels without daqd_fw crashing.

We have a new kernel image that's supposed to solve the module unload instability issue.  In order to try it we'll need to restart the entire system, though, so I'll do that on Monday morning.

I've got the CDS guys investigating the test point and EDCU issues, but we won't get any action on that until next week.

Quote:

Remaining unresolved issues:

  • IFO needs to be fully locked to make sure ALL components of all models are working.
  • The remaining red status lights are from the "FB NET" diagnostics, which are reflecting a missing status bit from the front end processes due to the fact that they were compiled with an earlier RCG version (3.0.3) than the mx_streams were (3.3+/trunk).  There will be a new release of the RTS soon, at which point we'll compile everything from the same version, which should get us all green again.
  • The entire system has been fully modernized, to the target CDS reference OS (Debian jessie) and more recent RCG versions.  The management of the various RTS components, both on the front ends and on fb, have as much as possible been updated to use the modern management tools (e.g. systemd, udev, etc.).  These changes need to be documented.  In particular...
  • The fb daqd process has been split into three separate components, a configuration that mirrors what is done at the sites and appears to be more stable: The "target" directory for all of these components is now:
    • daqd_dc: data concentrator (receives data from front ends)
    • daqd_fw: receives frames from dc and writes out full frames and second/minute trends
    • daqd_rcv: NDS1 server (raises test points and receives archive data from frames from 'nds' process)
    The "target" directory for all of these new components is:
    • /opt/rtcds/caltech/c1/target/daqd
    All of these processes are now managed under systemd supervision on fb, meaning the daqd restart procedure has changed.  This needs to be simplified and clarified.
  • Second trend frames are being written, but for some reason they're not accessible over NDS.
  • Have not had a chance to verify minute trend and raw minute trend writing yet.  Needs to be confirmed.
  • Get wiper script working on new fb.
  • Front end RTS kernel will occaissionally crash when the RTS modules are unloaded.  Keith Thorne apparently has a kernel version with a different set of patches from Gerrit Kuhn that does not have this problem.  Keith's kernel needs to be packaged and installed in the front end diskless root.
  • The models accessing the dolphin shared memory will ALL crash when one of the front end hosts on the dolphin network goes away.  This results in a boot fest of all the dolphin-enabled hosts.  Need to figure out what's going on there.
  • The RCG settings snapshotting has changed significantly in later RCG versions.  We need to make sure that all burt backup type stuff is still working correctly.
  • Restoration of /frames from old fb SCSI RAID?
  • Backup of entirety of fb1, including fb1 root (/) and front end diskless root (/diskless)
  • Full documentation of rebuild procedure from Jamie's notes.
  13205   Mon Aug 14 19:41:46 2017 JamieUpdateCDSfront-end/DAQ network down for kernel upgrade, and timing errors

I'm upgrading the linux kernel for all the front ends to one that is supposedly more stable and won't freeze when we unload RTS models (linux-image-3.2.88-csp).  Since it's a different kernel version it requires rebuilds of all kernel-related support stuff (mbuf, symmetricom, mx, open-mx, dolphin) and all the front end models.  All the support stuff has been upgraded, but we're now waiting on the front end rebuilds, which takes a while.

Initial testing indicates that the kernel is more stable; we're mostly able to unload/reload RTS modules without the kernel freezing.  However, the c1iscey host seems to be oddly problematic and has frozen twice so far on module unloads.  None of the other hosts have frozen on unload (yet), though, so still not clear.

We're now seeing some timing errors between the front ends and daqd, resulting in a "0x4000" status message in the 'C1:DAQ-DC0_*_STATUS' channels.  Part of the problem was an issue with the IRIG-B/GPS receiver timing unit, which I'll log in a separate post.  Another part of the problem was a bug in the symmetricom driver, which has been resolved.  That wasn't the whole problem, though, since we're still seeing timing errors.  Working with Jonathan to resolve.

  13207   Mon Aug 14 20:12:09 2017 Jamie, GautumUpdateCDSWeird problem with GPS receiver

Today we saw a weird issue with the GPS receiver (EndRun Technologies Tempus LX).  GPS timing on fb1 (which is handled via IRIG-B connection to the receiver with a spectracom card) was off by +18 seconds.  We tried resetting the GPS receiver and it still came up with +18 second offset.  To be clear, the GPS receiver unit itself was showing a time on it's front panel that looked close enough to 24-hour UTC, but was off by +18s.  The time also said "GPS" vertically to the right of the time.

We started exploring the settings on the GPS receiver and found this menu item:

Clock -> "Time Mode" -> "UTC"/"GPS"/"Local"

The setting when we found it was "GPS", which seems logical enough.  However, when we switched it to "UTC" the time as shown on the front panel was correct, now with "UTC" vertically to the right of the time, and fb1 was then showing the correct GPS time.

From the manual:

Time Mode
Time mode defines the time format used for the front-panel time display and, if installed, the optional
time code or Serial Time output. The time mode does not affect the NTP output, which is always
UTC. Possible values for the time mode are GPS, UTC, and local time. GPS time is derived from
the GPS satellite system. UTC is GPS time minus the current leap second correction. Local time is
UTC plus local offset and Daylight Savings Time. The local offset and daylight savings time displays
are described below.

The fact that moving to "UTC" fixed the problem, even though that is supposed to remove the leap second correction, might indicate that there's another bug in the symmetricom driver...

  13211   Tue Aug 15 16:32:42 2017 Jamie, GautumUpdateCDSGPS receiver apparently set to correct mode as "UTC"
Quote:

The setting when we found it was "GPS", which seems logical enough.  However, when we switched it to "UTC" the time as shown on the front panel was correct, now with "UTC" vertically to the right of the time, and fb1 was then showing the correct GPS time.

From Keith Thorne:

In the GPS receiver, you are trying to match the IRIG-B output format that is created by the aLIGO IRIG-B Fanout.  Since we have to prep the aLIGO IRIG-B Fanout every time there is a leap second coming, I would suspect that we are sending UTC to the IRIG-B receivers.  Thus, the GPS receiver needs to be set to that mode.

Soooo, "UTC" is the correct mode for the GPS receiver.

  13212   Wed Aug 16 14:54:13 2017 gautamUpdateCDSPSL monitoring Acromag EPICS server restarted

[johannes, gautam, jamie]

  • Made a directory /opt/rtcds/caltech/c1/scripts/Acromag/PSL where I copied over the files needed my modbusApp to start the server from Lydia's user directory
  • Edited /ligo/apps/ubuntu12/ligoapps-user-env.sh to export a couple of EPICS variables to facilitate easy startup of the EPICS server
  • Started a tmux session on (soon to be re-christened?) megatron called "acroEPICS"
  • Ran the following command to start up the EPICS server:
${EPICS_MODULES}/modbus/bin/${EPICS_HOST_ARCH}/modbusApp npro_config.cmd

To do:

  1. Make a startup script that runs the above command - eventually this can contain the initialization instructions for all the Acromags
  2. Figure out the initctl/systemctl stuff to make the server automatically restart if it drops for some reason (e.g. power failure)
  13215   Wed Aug 16 17:05:53 2017 JamieUpdateCDSfront-end/DAQ network down for kernel upgrade, and timing errors

The CDS system has now been up moved to a supposedly more stable real-time-patched linux kernel (3.2.88-csp) and RCG r4447 (roughly the head of trunk, intended to be release 3.4).  With one major and one minor exception, everything seems to be working:

The remaining issues are:

  • RFM network down.  The IOP models on all hosts on the RFM network are not detecting their RFM cards.  Keith Thorne thinks that this is because of changes in trunk to support the new long-range PCIe that will be used at the sites, and that we just need to add a new parameter to the cdsParameters block in models that use RFM.  Him and Rolf are looking into for us.
  • The 3.2.88-csp kernel is still not totally stable.  On most hosts (c1sus, c1ioo, c1iscex) it seems totally fine and we're able to load/unload models without issue.  c1iscey is definitely problematic, frequently freezing on module unload.  There must be a hardware/bios issue involved here.  c1lsc has also shown some problems.  A better kernel is supposedly in the works.
  • NDS clients other than DTT are still unable to raise test points.  This appears to be an issue with the daqd_rcv component (i.e. NDS server) not properly resolving the front ends in the GDS network.  Still looking into this with Keith, Rolf, and Jonathan.

Issues that have been fixed:

  • "EDCU" channels, i.e. non-front-end EPICS channels, are now being acquired properly by the DAQ.  The front-ends now send all slow channels to the daq over the MX network stream.  This means that front end channels should no longer be specified in the EDCU ini file.  There were a couple in there that I removed, and that seemed to fix that issue.
  • Data should now be recorded in all formats: full frames, as well as second, minute, and raw_minute trends
  • All FE and DAQD diagnostics are green (other than the ones indicating the problems with the RFM network).  This was fixed by getting the front ends models, mx_stream processes, and daqd processes all compiled against the same version of the advLigoRTS, and adding the appropriate command line parameters to the mx_stream processes.
  13216   Wed Aug 16 17:14:02 2017 KojiUpdateCDSfront-end/DAQ network down for kernel upgrade, and timing errors

What's the current backup situation?

  13217   Wed Aug 16 18:01:28 2017 JamieUpdateCDSfront-end/DAQ network down for kernel upgrade, and timing errors
Quote:

What's the current backup situation?

Good question.  We need to figure something out.  fb1 root is on a RAID1, so there is one layer of safety.  But we absolutely need a full backup of the fb1 root filesystem.  I don't have any great suggestions, other than just getting an external disk, 1T or so, and just copying all of root (minus NFS mounts).

  13218   Wed Aug 16 18:06:01 2017 KojiUpdateCDSfront-end/DAQ network down for kernel upgrade, and timing errors

We also need to copy chiara's root. What is the best way to get the full image of the root FS?
We may need to restore these root images to a different disk with a different capacity.

Is the dump command good for this?

  13219   Wed Aug 16 18:50:58 2017 JamieUpdateCDSfront-end/DAQ network down for kernel upgrade, and timing errors
Quote:

The remaining issues are:

  • RFM network down.  The IOP models on all hosts on the RFM network are not detecting their RFM cards.  Keith Thorne thinks that this is because of changes in trunk to support the new long-range PCIe that will be used at the sites, and that we just need to add a new parameter to the cdsParameters block in models that use RFM.  Him and Rolf are looking into for us.

RFM network is back!  Everything green again.

Use of RFM has been turned off in adLigoRTS trunk in favor of the new long-range PCIe networking being developed for the sites.  Rolf provided a single-line patch that re-enables it:

controls@c1sus:/opt/rtcds/rtscore/trunk 0$ svn diff
Index: src/epics/util/feCodeGen.pl
===================================================================
--- src/epics/util/feCodeGen.pl    (revision 4447)
+++ src/epics/util/feCodeGen.pl    (working copy)
@@ -122,7 +122,7 @@
 $diagTest = -1;
 $flipSignals = 0;
 $virtualiop = 0;
-$rfm_via_pcie = 1;
+$rfm_via_pcie = 0;
 $edcu = 0;
 $casdf = 0;
 $globalsdf = 0;
controls@c1sus:/opt/rtcds/rtscore/trunk 0$

This patched was applied to RTS source checkout we're using for the FE builds (/opt/rtcds/rtscore/trunk, which is r4447, and is linked to /opt/rtcds/rtscore/release).  The following models that use RFM were re-compiled, re-installed, and re-started:

  • c1x02
  • c1rfm
  • c1x03
  • c1als
  • c1x01
  • c1scx
  • c1asx
  • c1x05
  • c1scy
  • c1tst

The re-compiled models now see the RFM cards (dmesg log from c1ioo):

[24052.203469] c1x03: Total of 4 I/O modules found and mapped
[24052.203471] c1x03: ***************************************************************************
[24052.203473] c1x03: 1 RFM cards found
[24052.203474] c1x03:     RFM 0 is a VMIC_5565 module with Node ID 180
[24052.203476] c1x03: address is 0xffffc90021000000
[24052.203478] c1x03: ***************************************************************************

This cleared up all RFM transmission error messages.

CDS upstream are working to make this RFM usage switchable in a reasonable way.

  13249   Thu Aug 24 17:36:11 2017 gautamUpdateCDSFSS Slow Python maintenance

A couple of weeks ago, I was trying to modernize the python version of the FSS Slow temperature control loops, when I accidentally ended up deleting it frown. There was no svn backup. So the old Perl PID script has been running for the last few days.

Today, I checked out the latest version that Andrew and co. have running in the PSL lab. I had to make some important modifications for the script to work for the 40m setup.

  1. The script is conveniently setup in a way that the channels it needs to read from / write to are read in from an .ini file. I renamed all the channels to match the appropriate 40m ones.
  2. We don't have a soft epics channel in which to define the setpoint for our PID servo (which is 0). Rather than poke around with slow machine EPICS records, I simply commented out this line in the script and included the hard-coded value of 0. When we modernize to the Acromag era, we can setup an EPICS channel + MEDM slider for the setpoint.
  3. The way the Perl script was setup, the error signal was pre-scaled by a factor of 0.01, supposedly to make the PID gains be of order 1. For consistency, I re-inserted this scaling, which awade and co. had removed.
  4. Modified the FSSslowPy.init file to call the script in accordance with the new syntax:
python FSSSlow.py -i FSSSlowPy.ini

Then I stopped the Perl process on megatron by running

sudo initctl stop FSSslow

and started the Python process by running

sudo initctl start FSSslowPy

I have now committed the files FSSSlow.py and FSSSlowPy.ini to the 40m svn.  Things seem to be stable for the last 20 mins or so, let's keep an eye on this though - although we had been running the Python PID loop for some months, this version is a slightly modified one. 

The initctl stuff still isn't very robust - I think both the Autolocker and the FSS slow servos have to be manually restarted if megatron is shutdown/restarted for whatever reason. It doesn't seem to be a problem with the initctl routine itself - looking at the logs, I can see that init is trying to start both processes, but is failing to do so each time. To be investigated. The wiki procedure to restart this process is up to date.

GV Edit 0000 25 Aug 2017: I had to add a line to the script that checks MC transmission before enabling the PID loop. Change has been committed to svn. Now, when the MC loses lock or if the PSL shutter is kept closed for an extended period of time, the temperature loop doesn't rail.

  13262   Mon Aug 28 16:20:00 2017 gautamUpdateCDS40m files backup situation

This elog is meant to summarize the current backup situation of critical 40m files.

What are the critical filesystems? I've also indicated the size of these disks and the volume currently used, and the current backup situation. 

Name

Disk Usage

Description / remarks

Current backup status

FB1 root filesystem 1.7TB / 2TB
  • FB1 is the machine that hosts the diskless root for the front end machines
  • Additionally, it runs the daqd processes which write data from realtime models into frame files
Not backed up
/frames up to 24TB
  • This is where the frame files are written to 
  • Need to setup a wiper script that periodically clears older data so that the disk doesn't overflow.

Not backed up 

LDAS pulls files from nodus daily via rsync, so there's no cron job for us to manage. We just allow incoming rsync.

Shared user area 1.6TB / 2TB
  • /home/cds on chiara
  • This is exported over NFS to 40m workstations, FB1 etc.
  • Contains user directories, scripts, realtime models etc.

Local backup on /media/40mBackup on chiara via daily cronjob

Remote backup to ldas-cit.ligo.caltech.edu::40m/cvs via daily cronjob on nodus

Chiara root filesystem 11GB / 440GB
  • This is the root filesystem for chiara
  • Contains nameserver stuff for the martian network, responsible for rsyncing /home/cds
Not backed up
Megatron root filesystem 39GB / 130GB
  • Boot disk for megatron, which is our scripts machine
  • Runs MC autolocker, FSS loops etc.
  • Also is the nds server for facilitating data access from outside the martian network
Not backed up
Nodus root filesystem 77GB / 355GB
  • This is the boot disk for our gateway machine
  • Hosts Elog, svn, wikis
  • Supposed to be responsible for sending email alerts for NFS disk usage and vacuum system N2 pressure
Not backed up
JETSTOR RAID Array 12TB / 13TB
  • Old /frames
  • Archived frames from DRFPMI locks
  • Long term trends

Currently mounted on Megatron, not backed up.

Then there is Optimus, but I don't think there is anything critical on it. 

So, based on my understanding, we need to back up a whole bunch of stuff, particularly the boot disks and root filesystems for Chiara, Megatron and Nodus. We should also test that the backups we make are useful (i.e. we can recover current operating state in the event of a disk failure).

Please edit this elog if I have made a mistake. I also don't have any idea about whether there is any sort of backup for the slow computing system code.

  13263   Mon Aug 28 17:13:57 2017 ericqUpdateCDS40m files backup situation

In addition to bootable full disk backups, it would be wise to make sure the important service configuration files from each machine are version controlled in the 40m SVN. Things like apache files on nodus, martian hosts and DHCP files on chiara, nds2 configuration and init scripts on megatron, etc. This can make future OS/hardware upgrades easier too.

  13273   Wed Aug 30 10:54:26 2017 gautamUpdateCDSslow machine bootfest

MC autolocker and FSS loops were stuck because c1psl was unresponsive. I rebooted it and did a burtrestore to enable PSL locking. Then the IMC locked fine.

c1susaux and c1iscaux were also unresponsive so I keyed those crates as well, after taking the usual steps to avoid ITMX getting stuck - but it still got stuck when the Sat. Box. connectors were reconnected after the reboot, so I had to shake it loose with bias slider jiggling. This is annoying and also not very robust. I am afraid we are going to knock the ITMX magnets off at some point. Is this problem indicative of the fact that the ITMX magnets were somehow glued on in a skewed way? Or can we make the situation better by just tweaking the OSEM-holding fixtures on the cage?

In any case, I've started listing stuff down here for things we may want to do when we vent next.

 

  13279   Thu Aug 31 00:46:57 2017 ranaSummaryCDSallegra -> Scientific Linux 7.3

I made a 'LiveCD' on a 16 GB USB stick using this command after the GUIs didn't work and looking at some blog posts:

sudo dd if=SL-7.3-x86_64-2017-01-20-LiveCD.iso of=/dev/sdf

Quote:

Debian doesn't like EPICS. Or our XY plots of beam spots...Sad!

Quote:
Quote:

No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.

Ubuntu16 is not to my knowledge used for any CDS system anywhere.  I'm not sure how you expect to have better support for that.  There are no pre-compiled packages of any kind available for Ubuntu16.  Good luck, you big smelly doofuses. Nyah, nyah, nyah.

K Thorne recommends that we use SL7.3 with the 'xfce' window manager instead of the Debian family of products, so we'll try it out on allegra and rossa to see how it works for us. Hopefully the LLO CDS team will be the tip of the spear on solving the usual software problems we have when we "~up" grade.

  13282   Thu Aug 31 18:36:23 2017 gautamUpdateCDSrevisiting Acromag

Current status:

  • There is a single Acromag ADC unit installed in 1X4
  • It is presently hooked up to the PSL NPRO diagnostic connector channels
  • I had (re)-started the acquisiton of these channels on August 16 - but for reasons unknown, the tmux session that was supposed to be running the EPICS server on megatron seems to have died on August 22 (judging by the trend plot of these channels, see Attachment #1)
  • I had not set up an upstart job that restarts the server automatically in such an event. I manually restarted it for now, following the same procedure as linked in my previous elog.
  • While I was at it, I also took the opportunity to edit the Acromag channel names to something more appropriate - all channels previously prefixed with C1:ACRO- have now been prefixed with C1:PSL-

Plan of action:

  1. Hardware - we have, in the lab, in addition to the installed ADC unit
    • 3x 8 channel differential input ADC units
    • 2x 8 channel differential output DAC units
    • 1x 16 channel BIO unit
    • 2U chassis + connectors + breakout boards + other misc hardware that I think Johannes and Lydia procured with the original plan to replace the EX slow controls.
    • Some relevant elogs: Panel designs, breakout design, sketch for proposed layout, preliminary channel list.
      So on the hardware side, it would seem that we have everything we need to go ahead with replacing the EX slow controls with an Acromag system, although Johannes probably knows more about our state of readiness from a hardware PoV.
  2. Software
    • We probably want to get a dedicated machine that will handle the EPICS channel serving for the Acromag system
    • Have to figure out the networking arrangement for such a machine
    • Have to figure out how to set up the EPICS server protocol in such a way that if it drops for whatever reason, it is automatically restarted

 

Attachment 1: Acromag_EPICS.png
Acromag_EPICS.png
  13293   Tue Sep 5 14:41:58 2017 gautamUpdateCDSNDS2 server restarted on megatron

I was unable to download data using nds2. Gabriele had reported similar problems a week ago but I hadn't followed up on this.

I repeated steps 5-7 from elog 13161, and now it seems that I can get data from the nds2 servers again. Unclear why the nds2 server had to be restarted. I wonder if this is somehow related to the mysterious acromag EPICS server tmux session dropout.

  13297   Tue Sep 5 23:02:37 2017 gautamUpdateCDSslow machine bootfest

MC autolocker was not working - PCdrive was railed at its upper rail for ~2 hours judging by the wall StripTool trace. I tried restarting the init processes on megatron, but that didn't fix the problem. The reason seems to have been related to c1iool0 failing - after keying the crate, autolocker came back fine and MC caught lock almost immediately.

Additionally, c1susaux, c1auxex,c1auxey and c1iscaux are also down. I'm not planning on using the IFO tonight so I am not going to reboot these now.

 

  13312   Fri Sep 15 15:54:28 2017 gautamUpdateCDSFB wiper script

A wiper script is not yet set up for our new Frame-Builder. The disk usage is ~80% now, so I think we should start running a wiper script that manages overall disk usage and deletes old frame files to this end.

From what I could find on the elog, the way this was done was by running a cron job on FB. There is a perl script, /opt/rtcds/caltech/c1/target/fb/wiper.pl, which from what I could understand, runs a bunch of du commands on different directories to determine if there is a need to delete any files.

I copied this script over to /opt/rtcds/caltech/c1/target/daqd/wiper.pl. This is the directory in which all the new FB stuff resides. Conveniently, the script has a "dry-run" option, which I tried running on FB1. However, I get the following error message:

Fri Sep 15 15:44:45 PDT 2017
Dry run, will not remove any files!!!
You need to rerun this with --delete argument to really delete frame files
Directory disk usage:
 /frames/trend/minute_rawk
Combined 0k or 0m or 0Gb
Illegal division by zero at ./wiper.pl line 98.


So it would seem that for some reason, the du commands aren't working. From what I could tell, there aren't any directory paths specific to the old FB machine that need to be changed. I believe the script was working prior to the FB disk crash - unfortunately it doesn't look like this script was under version control but I don't think any changes have been made to this script.

Before I go down a Perl rabbit hole, has anyone seen such an error or is aware of some reason why this might not work on the new FB? Am I even using the correct scripts?

  13317   Mon Sep 18 17:17:49 2017 gautamUpdateCDSFB wiper script

After trying to debug this issue using the Perl debugger, I concluded that the problem is in the part of the code that splits the output of the "du" command into directory and disk usage. For whatever, reason, this isn't working. The version of perl running on the new FB1 machine is 5.20.2, whereas I suspect the version running on the old FB machine was 5.14.2 (which is the version on all the Ubuntu 12 workstations and megatron). Unclear whether downgrading the Perl version is the right way to go.

The FB1 disk is now getting close to full, the usage is up to 85% today.

Quote:

Before I go down a Perl rabbit hole, has anyone seen such an error or is aware of some reason why this might not work on the new FB? Am I even using the correct scripts?

 

  13318   Mon Sep 18 17:30:54 2017 ChrisUpdateCDSFB wiper script

Attached is the version of the wiper script we use on the CryoLab cymac. It works with perl v5.20.2. Is this different from what you have?

Attachment 1: wiper.pl
#!/usr/bin/perl
use File::Basename;

print "\n" .  `date` . "\n";
# Dry run, do not delete anything
$dry_run = 1;

if ($ARGV[0] eq "--delete") { $dry_run = 0; }

print "Dry run, will not remove any files!!!\n" if $dry_run;
... 184 more lines ...
  13319   Mon Sep 18 17:51:26 2017 gautamUpdateCDSFB wiper script

It is a little different - specifically, the way the splitting of the output of the "du" command into disk usage and directory is different (see Attachment #1). Apart from this, some of the parameters (e.g. what percentage to keep free) are different.

I changed the percentages to match what we had here, and edited a couple of other lines to print out the files that will be deleted. The dry run seemed to work okay, it produced the output below. Not sure why "df -h" reports a different use percentage though...

Since the script seems to be working now, I am going to set it up on FB1's crontab. Thanks Chris!.

controls@fb1:/opt/rtcds/caltech/c1/target/daqd 0$ ./wiper.pl
Mon Sep 18 17:47:06 PDT 2017
Dry run, will not remove any files!!!
You need to rerun this with --delete argument to really delete frame files
Directory disk usage:
/frames/trend/minute_raw 47126124k
/frames/trend/minute 22900668k
/frames/trend/second 760359168k
/frames/full 19337278516k
Combined 20167664476k or 19694984m or 19233Gb
/frames size 25097525144k at 80.36%
/frames is below keep value of 85.00%
Will not delete any files
df reported usage 80.36%
controls@fb1:/opt/rtcds/caltech/c1/target/daqd 0$ df -h
Filesystem                        Size  Used Avail Use% Mounted on
/dev/sda4                         2.0T  1.7T  152G  92% /
udev                               10M     0   10M   0% /dev
tmpfs                              13G  177M   13G   2% /run
tmpfs                              32G     0   32G   0% /dev/shm
tmpfs                             5.0M     0  5.0M   0% /run/lock
tmpfs                              32G     0   32G   0% /sys/fs/cgroup
/dev/sda2                          19G  3.7G   14G  21% /var
/dev/sda1                         461M   65M  373M  15% /boot
/dev/sdb1                          24T   19T  3.5T  85% /frames
192.168.113.104:/home/cds/rtcds   2.0T  1.6T  291G  85% /opt/rtcds
192.168.113.104:/home/cds/rtapps  2.0T  1.6T  291G  85% /opt/rtapps
tmpfs                             6.3G     0  6.3G   0% /run/user/1001
Quote:

Attached is the version of the wiper script we use on the CryoLab cymac. It works with perl v5.20.2. Is this different from what you have?

 

Attachment 1: perlDiff.png
perlDiff.png
  13320   Mon Sep 18 18:40:34 2017 gautamUpdateCDSFB wiper script

I did a further check on the wiper script by changing the "percent_keep" from 85.0 to 75.0, and running the script in "dry_run" mode again. The script then output to console the names of all the files it would delete in order to free up the required amount of space (but didn't actually delete any files as it was a dry run). Seemed to be sensible.

To set up the cron job, I did the following on FB1:

  • crontab -e opened up the crontab
  • Copied over a script called "wiper.cron" from /opt/rtcds/caltech/c1/target/fb to /opt/rtcds/caltech/c1/target/daqd. This essentially contains a bunch of instructions to run the wiper script with the --delete flag, and write the console output to a log file.
  • Added the following line: 33 3 * * * /opt/rtcds/caltech/c1/target/daqd/wiper.cron. So the cron job should be executed at 3:33AM everyday.
  • The cron daemon seems to be running - sudo systemctl status cron.service yields the following output:
    controls@fb1:~ 0$ sudo systemctl status cron.service
    ● cron.service - Regular background program processing daemon
       Loaded: loaded (/lib/systemd/system/cron.service; enabled)
       Active: active (running) since Mon 2017-09-18 18:16:58 PDT; 27min ago
         Docs: man:cron(8)
     Main PID: 30183 (cron)
       CGroup: /system.slice/cron.service
               └─30183 /usr/sbin/cron -f
    Sep 18 18:16:58 fb1 cron[30183]: (CRON) INFO (Skipping @reboot jobs -- not system startup)
    Sep 18 18:17:01 fb1 CRON[30205]: pam_unix(cron:session): session opened for user root by (uid=0)
    Sep 18 18:17:01 fb1 CRON[30206]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
    Sep 18 18:17:01 fb1 CRON[30205]: pam_unix(cron:session): session closed for user root
    Sep 18 18:25:01 fb1 CRON[30820]: pam_unix(cron:session): session opened for user root by (uid=0)
    Sep 18 18:25:01 fb1 CRON[30821]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
    Sep 18 18:25:01 fb1 CRON[30820]: pam_unix(cron:session): session closed for user root
    Sep 18 18:35:01 fb1 CRON[31515]: pam_unix(cron:session): session opened for user root by (uid=0)
    Sep 18 18:35:01 fb1 CRON[31516]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
    Sep 18 18:35:01 fb1 CRON[31515]: pam_unix(cron:session): session closed for user root

     

  • crontab -l on FB1 now shows the following:
    controls@fb1:~ 0$ crontab -l
    # Edit this file to introduce tasks to be run by cron.
    #
    # Each task to run has to be defined through a single line
    # indicating with different fields when the task will be run
    # and what command to run for the task
    #
    # To define the time you can provide concrete values for
    # minute (m), hour (h), day of month (dom), month (mon),
    # and day of week (dow) or use '*' in these fields (for 'any').#
    # Notice that tasks will be started based on the cron's system
    # daemon's notion of time and timezones.
    #
    # Output of the crontab jobs (including errors) is sent through
    # email to the user the crontab file belongs to (unless redirected).
    #
    # For example, you can run a backup of all your user accounts
    # at 5 a.m every week with:
    # 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/
    #
    # For more information see the manual pages of crontab(5) and cron(8)
    #
    # m h  dom mon dow   command
    33 3 * * * /opt/rtcds/caltech/c1/target/daqd/wiper.cron

Let's see if this works.

Quote:

Since the script seems to be working now, I am going to set it up on FB1's crontab. Thanks Chris!.

 

  13331   Tue Sep 26 13:40:45 2017 gautamUpdateCDSNDS2 server restarted on megatron

Gabriele reported problems with the nds2 server again. I restarted it again.

update: had to do it again at 1730 today - unclear why nds2 is so flaky. Log files don't suggest anything obvious to me...

Quote:

I was unable to download data using nds2. Gabriele had reported similar problems a week ago but I hadn't followed up on this.

I repeated steps 5-7 from elog 13161, and now it seems that I can get data from the nds2 servers again. Unclear why the nds2 server had to be restarted. I wonder if this is somehow related to the mysterious acromag EPICS server tmux session dropout.

 

  13332   Tue Sep 26 15:55:20 2017 gautamUpdateCDS40m files backup situation

Backups of the root filesystems of chiara and nodus are underway right now. I am backing them up to the 1 TB LaCie external hard drives we recently acquired.

I first initialized the drives by hooking them up to my computer and running the setup.app file. After this, plugging the drive into the respective machine and running lsblk, I was able to see the mount point of the external drive. To actually initialize the backup, I ran the following command from a tmux session called ddBackupLaCie:

sudo dd if=/dev/sda of=/dev/sdb bs=64K conv=noerror,sync

Here, /dev/sda is the disk with the root filesystem, and /dev/sdb is the external hard-drive. The installed version of dd is 8.13, and from version 8.21 onwards, there is a progress flag available, but I didn't want to go through the exercise of upgrading coreutils on multiple machines, so we just have to wait till the backup finishes.

We also wanted to do a backup of the root of FB1 - but I'm not sure if dd will work with the external hard drive, because I think it requires the backup disk size (for us, 1TB) to be >= origin disk size (which on FB1, according to df -h, is 2TB). Unsure why the root filesystem of FB is so big, I'm checking with Jamie what we expect it to be. Anyways we have also acquired 2TB HGST SATA drives, which I will use if the LaCie disks aren't an option.

 

  13339   Thu Sep 28 10:33:46 2017 gautamUpdateCDS40m files backup situation

After consulting with Jamie, we reached the conclusion that the reason why the root of FB1 is so huge is because of the way the RAID for /frames is setup. Based on my googling, I couldn't find a way to exclude the nfs stuff while doing a backup using dd, which isn't all that surprising because dd is supposed to make an exact replica of the disk being cloned, including any empty space. So we don't have that flexibility with dd. The advantage of using dd is that if it works, we have a plug-and-play clone of the boot disk and root filesystem which we can use in the event of a hard-disk failure.

  1. One option would be to stop all the daqd processes, unmount /frames, and then do a dd backup of the true boot disk and root filesystem.
  2. Another option would be to use rsync to do the backup - this way we can selectively copy the files we want and ignore the nfs stuff. I suspect this is what we will have to do for the second layer of backup we have planned, which will be run as a daily cron job. But I don't think this approach will give us a plug-and-play replacement disk in the event of a disk failure.
  3. Third option is to use one of the 2TB HGST drives, and just do a dd backup - some of this will be /frames, but that's okay I guess.

I am trying option 3 now. dd however does requrie that the destination drive size be >= source drive size - I'm not sure if this is true for the HGST drives. lsblk suggests that the drive size is 1.8TB, while the boot disk, /dev/sda, is 2TB. Let's see if it works.

Backup of chiara is done. I checked that I could mount the external drive at /mnt and access the files. We should still do a check of trying to boot from the LaCie backup disk, need another computer for that.

nodus backup is still not complete according to the console - there is no progress indicator so we just have to wait I guess.

Quote:

Backups of the root filesystems of chiara and nodus are underway right now. I am backing them up to the 1 TB LaCie external hard drives we recently acquired.

We also wanted to do a backup of the root of FB1 - but I'm not sure if dd will work with the external hard drive, because I think it requires the backup disk size (for us, 1TB) to be >= origin disk size (which on FB1, according to df -h, is 2TB). Unsure why the root filesystem of FB is so big, I'm checking with Jamie what we expect it to be. Anyways we have also acquired 2TB HGST SATA drives, which I will use if the LaCie disks aren't an option.

 

 

  13340   Thu Sep 28 11:13:32 2017 jamieUpdateCDS40m files backup situation

 

Quote:

After consulting with Jamie, we reached the conclusion that the reason why the root of FB1 is so huge is because of the way the RAID for /frames is setup. Based on my googling, I couldn't find a way to exclude the nfs stuff while doing a backup using dd, which isn't all that surprising because dd is supposed to make an exact replica of the disk being cloned, including any empty space. So we don't have that flexibility with dd. The advantage of using dd is that if it works, we have a plug-and-play clone of the boot disk and root filesystem which we can use in the event of a hard-disk failure.

  1. One option would be to stop all the daqd processes, unmount /frames, and then do a dd backup of the true boot disk and root filesystem.
  2. Another option would be to use rsync to do the backup - this way we can selectively copy the files we want and ignore the nfs stuff. I suspect this is what we will have to do for the second layer of backup we have planned, which will be run as a daily cron job. But I don't think this approach will give us a plug-and-play replacement disk in the event of a disk failure.
  3. Third option is to use one of the 2TB HGST drives, and just do a dd backup - some of this will be /frames, but that's okay I guess.

This is not quite right.  First of all, /frames is not NFS.  It's a mount of a local filesystem that happens to be on a RAID.  Second, the frames RAID is mounted at /frames.  If you do a dd of the underlying block device (in this case /dev/sda*, you're not going to copy anything that's mounted on top of it.

What i was saying about /frames is that I believe there is data in the underlying directory /frames that the frames RAID is mounted on top of.  In order to not get that in the copy of /dev/sda4 you would need to unmount the frames RAID from /frames, and delete everything from the /frames directory.  This would not harm the frames RAID at all.

But it doesn't really matter because the backup disk has space to cover the whole thing so just don't worry about it.  Just dd /dev/sda to the backup disk and you'll just be copying the root filesystem, which is what we want.

  13341   Thu Sep 28 23:32:38 2017 gautamHowToCDSpyawg

I've modified the __init.py__ file located at /ligo/apps/linux-x86_64/cdsutils-480/lib/python2.7/site-packages/cdsutils/__init__.py so that you can now simply import pyawg from cdsutils. On the control room workstations, iPython is set up such that cdsutils is automatically imported as "cds". Now this import also includes the pyawg stuff. So to use some pyawg function, you would just do (for example):

exc=cds.awg.ArbitraryLoop(excChan,excit,rate=fs)

One could also explicitly do the import if cdsutils isn't automatically imported:

from cdsutils import awg

pyawg-away!


Linking this useful instructional elog from Chris here: https://nodus.ligo.caltech.edu:8081/Cryo_Lab/1748

  13342   Thu Sep 28 23:47:38 2017 gautamUpdateCDS40m files backup situation

The nodus backup too is now complete - however, I am unable to mount the backup disk anywhere. I tried on a couple of different machines (optimus, chiara and pianosa), but always get the same error:

mount: unknown filesystem type 'LVM2_member'

The disk itself is being recognized, and I can see the partitions when I run lsblk, but I can't get the disk to actually mount.

Doing a web-search, I came across a few blog posts that look like the problem can be resolved using the vgchange utility - but I am not sure what exactly this does so I am holding off on trying.

To clarify, I performed the cloning by running

sudo dd if=/dev/sda of=/dev/sdb bs=64K conv=noerror,sync

in a tmux session on nodus (as I did for chiara and FB1, latter backup is still running). 

  13344   Fri Sep 29 09:43:52 2017 jamieHowToCDSpyawg

 

Quote:

I've modified the __init.py__ file located at /ligo/apps/linux-x86_64/cdsutils-480/lib/python2.7/site-packages/cdsutils/__init__.py so that you can now simply import pyawg from cdsutils. On the control room workstations, iPython is set up such that cdsutils is automatically imported as "cds". Now this import also includes the pyawg stuff. So to use some pyawg function, you would just do (for example):

exc=cds.awg.ArbitraryLoop(excChan,excit,rate=fs)

One could also explicitly do the import if cdsutils isn't automatically imported:

from cdsutils import awg

pyawg-away!


Linking this useful instructional elog from Chris here: https://nodus.ligo.caltech.edu:8081/Cryo_Lab/1748

?  Why aren't you able to just import 'awg' directly?  You shouldn't have to import it through cdsutils.  Something must be funny with the config.

  13345   Fri Sep 29 11:07:16 2017 gautamUpdateCDS40m files backup situation

The FB1 dd backup process seems to have finished too - but I got the following message:

dd: error writing ‘/dev/sdc’: No space left on device
30523666+0 records in
30523665+0 records out
2000398934016 bytes (2.0 TB) copied, 50865.1 s, 39.3 MB/s

Running lsblk shows the following:

controls@fb1:~ 32$ lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sdb      8:16   0 23.5T  0 disk
└─sdb1   8:17   0 23.5T  0 part /frames
sda      8:0    0    2T  0 disk
├─sda1   8:1    0  476M  0 part /boot
├─sda2   8:2    0 18.6G  0 part /var
├─sda3   8:3    0  8.4G  0 part [SWAP]
└─sda4   8:4    0    2T  0 part /
sdc      8:32   0  1.8T  0 disk
├─sdc1   8:33   0  476M  0 part
├─sdc2   8:34   0 18.6G  0 part
├─sdc3   8:35   0  8.4G  0 part
└─sdc4   8:36   0  1.8T  0 part

While I am able to mount /dev/sdc1, I can't mount /dev/sdc4, for which I get the error message

controls@fb1:~ 0$ sudo mount /dev/sdc4 /mnt/HGSTbackup/
mount: wrong fs type, bad option, bad superblock on /dev/sdc4,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail or so.

Looking at dmesg, it looks like this error is related to the fact that we are trying to clone a 2TB disk onto a 1.8TB disk - it complains about block size exceeding device size.

So if we either have to get a larger disk (4TB?) to do the dd backup, or do the backing up some other way (e.g. unmount /frames RAID, delete everything in /frames, and then do dd, as Jamie suggested). If I understand correctly, unmounting /frames RAID will require that we stop all the daqd processes for the duration of the dd backup

Quote:
 
sudo dd if=/dev/sda of=/dev/sdb bs=64K conv=noerror,sync

in a tmux session on nodus (as I did for chiara and FB1, latter backup is still running). 


Edit: unmounting /frames won't help, since dd makes a bit for bit copy of the drive being cloned. So we need a drive with size that is >= that of the drive we are trying to clone. On FB1, this is /dev/sda, which has a size of 2TB. The HGST drive we got has an advertised size of 2TB, but looks like actually only 1.8TB is available. So I think we need to order a 4TB drive.

  13349   Mon Oct 2 18:08:10 2017 gautamUpdateCDSc1ioo DC errors

I was trying to set up a DAC channel to interface with the AOM driver on the PSL table.

  • It would have been most convenient to use channels from c1ioo given proximity to the PSL table.
  • Looking at the 1X2 rack, it looked like there were indeed some spare DAC channels available.
  • So I thought I'd run a test by adding some TPs to the c1als model (because it seems to have the most head room in terms of CPU time used).
  • I added the DAC_0 block from CDS_PARTS library to c1als model (after confirming that the same part existed in the IOP model, c1x03).
  • Model recompiled fine (I ran rtcds make c1als and rtcds install c1als on c1ioo).
  • However, I got a bunch of errors when I tried to restart the model with rtcds restart c1als. The model itself never came up.
  • Looking at dmesg, I saw stuff like
    [4072817.132040] c1als: Failed to allocate DAC channel.
    [4072817.132040] c1als: DAC local 0 global 16 channel 4 is already allocated.
    [4072817.132040] c1als: Failed to allocate DAC channel.
    [4072817.132040] c1als: DAC local 0 global 16 channel 5 is already allocated.
    [4072817.132040] c1als: Failed to allocate DAC channel.
    [4072817.132040] c1als: DAC local 0 global 16 channel 6 is already allocated.
    [4072817.132040] c1als: Failed to allocate DAC channel.
    [4072817.132040] c1als: DAC local 0 global 16 channel 7 is already allocated.
    [4073325.317369] c1als: Setting stop_working_threads to 1
  • Looking more closely at the log messages, it seemed like rtcds could not find any DAC cards on c1ioo.
  • I went back to 1X2 and looked inside the expansion chassis. I could only find two ADC cards and 1 BIO card installed. The SCSI cable labelled ("DAC 0") running from the rear of the expansion chassis to the 1U SCSI->40pin IDE breakout chassis wasn't actually connected to anything inside the expansion chassis.
  • I then undid my changes (i.e. deleted all parts I added in the simulink diagram), and recompiled c1als.
  • This time the model came back up but I saw a "0x2000" error in the GDS overview MEDM screen.
  • Since there are no DACs installed in the c1ioo expansion chassis, I thought perhaps the problem had to do with the fact that there was a "DAC_0" block in the c1x03 simulink diagram - so I deleted this block, recompiled c1x03, and for good measure, restarted all (three) models on c1ioo.
  • Now, however, I get the same 0x2000 error on both the c1x03 and c1als GDS overview MEDM screens (see Attachment #1).
  • An elog search revealed that perhaps this error is related to DAQ channels being specified without recording rates (e.g. 16384, 2048 etc). There were a few DAQ channels inside c1als which didn't have recording rates specified, so I added the rates, and restarted the models, but the errors persist.
  • According to the RCG runtime diagnostics document, T1100625 (which admittedly is for RCG v 2.7 while we are running v3.4), this error has to do with a mismatch between the DAQ config files read by the RTS and the DAQD system, but I'm not sure how to debug this further.
  • I also suspect there is something wrong with the mx processes:
    controls@c1ioo:~ 130$ sudo systemctl status mx
    ● open-mx.service - LSB: starts Open-MX driver
       Loaded: loaded (/etc/init.d/open-mx)
       Active: failed (Result: exit-code) since Tue 2017-10-03 00:27:32 UTC; 34min ago
      Process: 29572 ExecStop=/etc/init.d/open-mx stop (code=exited, status=1/FAILURE)
      Process: 32507 ExecStart=/etc/init.d/open-mx start (code=exited, status=1/FAILURE)
    Oct 03 00:27:32 c1ioo systemd[1]: Starting LSB: starts Open-MX driver...
    Oct 03 00:27:32 c1ioo open-mx[32507]: Loading Open-MX driver (with  ifnames=eth1 )
    Oct 03 00:27:32 c1ioo open-mx[32507]: insmod: ERROR: could not insert module /opt/3.2.88-csp/open-mx-1.5.4/modules/3.2.88-csp/open-mx.ko: File exists
    Oct 03 00:27:32 c1ioo systemd[1]: open-mx.service: control process exited, code=exited status=1
    Oct 03 00:27:32 c1ioo systemd[1]: Failed to start LSB: starts Open-MX driver.
    Oct 03 00:27:32 c1ioo systemd[1]: Unit open-mx.service entered failed state.
  • Not sure if this is related to the DC error though.
Attachment 1: c1ioo_CDS_errors.png
c1ioo_CDS_errors.png
  13350   Mon Oct 2 18:50:55 2017 jamieUpdateCDSc1ioo DC errors
Quote:

 

  • This time the model came back up but I saw a "0x2000" error in the GDS overview MEDM screen.
  • Since there are no DACs installed in the c1ioo expansion chassis, I thought perhaps the problem had to do with the fact that there was a "DAC_0" block in the c1x03 simulink diagram - so I deleted this block, recompiled c1x03, and for good measure, restarted all (three) models on c1ioo.
  • Now, however, I get the same 0x2000 error on both the c1x03 and c1als GDS overview MEDM screens (see Attachment #1).

From page 21 of T1100625, DAQ status "0x2000" means that the channel list is out of sync between the front end and the daqd.  This usually happens when you add channels to the model and don't restart the daqd processes, which sounds like it might be applicable here.

It looks like open-mx is loaded fine (via "rtcds lsmod"), even though the systemd unit is complaining.  I think this is because the open-mx service is old style and is not intended for module loading/unloading with the new style systemd stuff.

  13351   Mon Oct 2 19:03:49 2017 gautamUpdateCDS[Solved] c1ioo DC errors

This did the trick - I simply ran

sudo systemctl restart daqd_*

on FB1, and now all the CDS overview lights are green again.

I thought I had done this already, but I realize that I was supposed to restart the daqd processes on FB1 (which is where they are running) and not on c1ioo frown.

Thanks Jamie for the speedy resolution!

Quote:
Quote:

 

  • This time the model came back up but I saw a "0x2000" error in the GDS overview MEDM screen.
  • Since there are no DACs installed in the c1ioo expansion chassis, I thought perhaps the problem had to do with the fact that there was a "DAC_0" block in the c1x03 simulink diagram - so I deleted this block, recompiled c1x03, and for good measure, restarted all (three) models on c1ioo.
  • Now, however, I get the same 0x2000 error on both the c1x03 and c1als GDS overview MEDM screens (see Attachment #1).

From page 21 of T1100625, DAQ status "0x2000" means that the channel list is out of sync between the front end and the daqd.  This usually happens when you add channels to the model and don't restart the daqd processes, which sounds like it might be applicable here.

It looks like open-mx is loaded fine (via "rtcds lsmod"), even though the systemd unit is complaining.  I think this is because the open-mx service is old style and is not intended for module loading/unloading with the new style systemd stuff.

 

Attachment 1: CDSoverview.png
CDSoverview.png
  13355   Tue Oct 3 19:39:10 2017 gautamUpdateCDSslow machine bootfest

Eurocrate key turning reboots for c1susaux, c1auxex,c1auxey, c1iscaux, c1iscaux2 and c1aux. Usual precautions were taken for ITMX. Did burtrestore for c1iscaux andc1iscaux2  in order to restore the LSC PD whitening gains.


Un-related to this work: input pointing into PMC was tweaked as the PMC_REFL spot was pretty bright.

  13356   Wed Oct 4 17:18:15 2017 gautamUpdateCDSFree DAC channels in c1lsc

There are at least 5 free DAC channels (4 if you discount the one channel from these that I am hijacking) available in the 1Y2 electronics rack.

Jamie's nice wiring diagram shows the topology - the actual DAC card sits in 1Y3 inside the c1lsc expansion chassis (while the c1lsc frontend itself is in 1X4). The output of the DAC goes via SCSI to an interface box (D080303) and then to some dewhitening/AI boards (D000316). There are a total of 16 DAC channels available, out of which 8 are used for the TTs, 2 are used for the DAFI model, and one is labeleld "From c1ioo 1X2" (I don't know what this one is for). So I'm going to use some of these channels for measuring the coupling of oscillator noise and intensity noise to MICH in the DRMI lock.

The de-whitening/AI board seems to be old - it has 2x 800Hz Butterworth LPFs and no notch for the clock frequency, but maybe this doesn't matter for the tests I have in mind. The AI board available on 1X2 is more modern but routing the DAC channels from 1Y2 to it is going to be some work.

I'm going to add my testpoint to c1daf given that it seems to be the least critical model on c1lsc.

EDIT: testpoints added to c1daf don't show up in the list of available channels - there was some issue with this model while we were getting the new RTCDS going. So I'm moving my temporary testpoint to c1cal instead.

  13360   Thu Oct 5 11:46:15 2017 gautamUpdateCDSslow machine bootfest

MC Autolocker was umnhappy because c1iool0 was unresponsive and hence it couldn't write to the "C1:IOO-MC_AUTOLOCK_BEAT" channel. I keyed the crate and IMC locked almost immediately. I'm moving this channel into the RTCDS model as we did for the IFO_STATE EPICS channel so that the autolocker isn't dependant on c1iool0 (which was the whole point of migrating the IFO-STATE variable anyways). I also commented out all of these channels in /cvs/cds/caltech/target/c1iool0/autolocker.db so that there aren't duplicate channels.

Quote:

Eurocrate key turning reboots for c1susaux, c1auxex,c1auxey, c1iscaux, c1iscaux2 and c1aux. Usual precautions were taken for ITMX. Did burtrestore for c1iscaux andc1iscaux2  in order to restore the LSC PD whitening gains.


Un-related to this work: input pointing into PMC was tweaked as the PMC_REFL spot was pretty bright.

 

  13361   Thu Oct 5 13:58:26 2017 gautamUpdateCDS40m files backup situation

The 4TB HGST drives have arrived. I've started the FB1 dd backup process. Should take a day or so.

Quote:
Edit: unmounting /frames won't help, since dd makes a bit for bit copy of the drive being cloned. So we need a drive with size that is >= that of the drive we are trying to clone. On FB1, this is /dev/sda, which has a size of 2TB. The HGST drive we got has an advertised size of 2TB, but looks like actually only 1.8TB is available. So I think we need to order a 4TB drive.

 

  13364   Fri Oct 6 12:46:17 2017 gautamUpdateCDS40m files backup situation

Looks to have worked this time around.

controls@fb1:~ 0$ sudo dd if=/dev/sda of=/dev/sdc bs=64K conv=noerror,sync
33554416+0 records in
33554416+0 records out
2199022206976 bytes (2.2 TB) copied, 55910.3 s, 39.3 MB/s
You have new mail in /var/mail/controls

I was able to mount all the partitions on the cloned disk. Will now try booting from this disk on the spare machine I am testing in the office area now. That'd be a "real" test of if this backup is useful in the event of a disk failure.

Quote:

The 4TB HGST drives have arrived. I've started the FB1 dd backup process. Should take a day or so.

 

  13379   Thu Oct 12 14:42:45 2017 gautamUpdateCDSslow machine bootfest

Steve reported problems getting the X arm locked. Alignment sliders were inaccessible. Eurocrate key turning reboots for c1susaux, c1auxex,c1auxey, c1iscaux and c1aux. Usual precautions were taken for ITMX.

This is becoming a once-a-week thing sad.

  13381   Mon Oct 16 12:13:38 2017 gautamUpdateCDSMegatron maintenance

Wall StripTool traces showed that IMC has not been locked for at least 8 hours when I came in this morning. Going to the IMC autolocker log, it looks like the last timestamp was at ~6pm yesterday. Megatron was responding to ping, but I couldn't ssh into it. So I went over to the machine and did a hard-reboot via front panel power switch. The computer took ~10mins to come back online and respond to ping. Once it did, I was able to ssh into it. However, trying the usual commands to restart the IMC autolocker and FSS Slow loops didn't work. Specifically, monitoring the logfile with tail -f Autolocker.log, I would see that the autolocker seemed to get stuck after starting the "blinky" script. Trying to restart the process using sudo initctl restart MCautolocker, init would print to shell that the restart had worked, and reported the PID, but the logfile wouldn't update "live" as it should when tail is used with the -f option. All very strange frown.

Anyways, as a last resort, I kill -9'ed the PID for the init instance, and init automatically restarted the Autolocker - this did the trick, IMC is locked now and logfile seems to be getting updated normallyyes.

I also cleared a bunch of matlab crash dump files in the home directory.

  13385   Tue Oct 17 23:07:52 2017 gautamUpdateCDSFEs unresponsive

While working on the IFO tonight, I noticed that the blinky status lights on c1iscex and c1iscey were frozen (but those on the other 3 FEs seemed fine). But all other lights on the CDS overview screen were green I couldn't access testpoints from these machines, and the EPICS readbacks for models on these FEs (e.g. Oplev servo inputs outputs etc) were frozen at some fixed value. This lasted for a good 5 minutes at least. But the blinky lights started blinking again without me doing anything. Not sure what to make of this. I am also not sure how to diagnose this problem, as trending the slow EPICS records of the CPU execution cycle time (for example) doesn't show any irregularity.

  13386   Wed Oct 18 01:41:32 2017 jamieUpdateCDSFEs unresponsive
Quote:

While working on the IFO tonight, I noticed that the blinky status lights on c1iscex and c1iscey were frozen (but those on the other 3 FEs seemed fine). But all other lights on the CDS overview screen were green I couldn't access testpoints from these machines, and the EPICS readbacks for models on these FEs (e.g. Oplev servo inputs outputs etc) were frozen at some fixed value. This lasted for a good 5 minutes at least. But the blinky lights started blinking again without me doing anything. Not sure what to make of this. I am also not sure how to diagnose this problem, as trending the slow EPICS records of the CPU execution cycle time (for example) doesn't show any irregularity.

So this wasn't just an EPICS freeze?  I don't see how this had anything to do with any of the work I did earlier today.  I didn't modify any of the running front ends, didn't touch either of the end station machines or the DAQ, and didn't modify the network in any way.  I didn't leave anything running.

If you couldn't access test points then it sounds like it was more than just EPICS.  It sounds like maybe the end machines somehow fell of the network momentarily.  Was there anything else going on at the time?

  13387   Wed Oct 18 02:09:32 2017 gautamUpdateCDSFEs unresponsive

I was looking at the ASDC channel on dataviewer, and toggling various settings like whitening gain. At some point, the signal just froze. So I quit dataviewer and tried restarting it, at which point it complained about not being able to connect to FB. This is when I brought up the CDS_OVERVIEW medm screen, and noticed the frozen 1pps indicator lights. There was certainly something going on with the end FEs, because I was able to ping the machine, but not ssh into it. Once the 1pps lights came back, I was able to ssh into c1iscex and c1iscey, no problems.

Could it be that some of the mx processes stalled, but the systemctl routine automatically restarted them after some time? 

Quote:

So this wasn't just an EPICS freeze?  I don't see how this had anything to do with any of the work I did earlier today.  I didn't modify any of the running front ends, didn't touch either of the end station machines or the DAQ, and didn't modify the network in any way.  I didn't leave anything running.

If you couldn't access test points then it sounds like it was more than just EPICS.  It sounds like maybe the end machines somehow fell of the network momentarily.  Was there anything else going on at the time?

 

  13388   Wed Oct 18 09:21:22 2017 jamieUpdateCDSFEs unresponsive
Quote:

I was looking at the ASDC channel on dataviewer, and toggling various settings like whitening gain. At some point, the signal just froze. So I quit dataviewer and tried restarting it, at which point it complained about not being able to connect to FB. This is when I brought up the CDS_OVERVIEW medm screen, and noticed the frozen 1pps indicator lights. There was certainly something going on with the end FEs, because I was able to ping the machine, but not ssh into it. Once the 1pps lights came back, I was able to ssh into c1iscex and c1iscey, no problems.

Could it be that some of the mx processes stalled, but the systemctl routine automatically restarted them after some time?

An mx_stream glitch would have interrupted data flowing from the front end to the DAQ, but it wouldn't have affected the heartbeat.  The heartbeat stop could mean either that the front end process froze, or the EPICS communication stopped.  The fact that everything came back fine after a couple of minutes indicates to me that the front end processes all kept running fine.  If they hadn't I'm sure the machines would have locked up.  The fact that you couldn't connect to the FE machine is also suspicious.

My best guess is that there was a network glitch on the martian network.  I don't know how to account for the fact that pings still worked, though.

  13394   Wed Oct 18 23:11:53 2017 gautamUpdateCDSFEs unresponsive

This happened again just now - it was roughly this time when this happened last night as well.

There was certainly an EPICS freeze of the kind we were used to seeing prior to replacing the martian wireless router sometime in late 2015 (or early 2016?). I was trying to run the dither alignment servos on the Y-arm at this time, and all the StripTool traces flatlined.

I took the opportunity to try accessing testpoints from the iscey ADCs - specifically C1:SUS-TRY_OUT, and it seemed to work just fine. However, I couldn't ssh into c1iscey.

Looking at the dmesg once I was able to ssh in eventually (~2 minutes deadtime tonight, I feel like it was longer yesterday but can't quantify), I see the following: not sure if there are any clues in here, or whether this is the correct log to check. But there are many instances of the nfs server related message in the log. Note that the system time-stamp corresponds to when this freeze happened.

[5461308.784018] nfs: server 192.168.113.201 not responding, still trying
[5461412.936284] nfs: server 192.168.113.201 OK
[5461412.937130] systemd[1]: Starting Journal Service...
[5461412.947947] systemd-journald[20281]: Received SIGTERM from PID 1 (systemd).
[5461412.996063] systemd[1]: Unit systemd-journald.service entered failed state.
[5461413.002627] systemd[1]: systemd-journald.service has no holdoff time, scheduling restart.
[5461413.008983] systemd[1]: Stopping Journal Service...
[5461413.014664] systemd[1]: Starting Journal Service...
[5461413.044262] systemd[1]: Started Journal Service.
[5461413.694838] systemd-journald[400]: Received request to flush runtime journal from PID 1

 

ELOG V3.1.3-