40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 52 of 341  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  12763   Fri Jan 27 17:49:41 2017 jamieUpdateCDStest of new daqd code on fb1

Just FYI I'm running a test of updated daqd code on fb1. 

fb1 has it's own fiber to the daq network switch, so nothing had to be modified to do this test. This *should* not affect anything in the rest of the system, but as we all know these are famous last words....  If something is going haywire, and you can't get in touch with me and can't figure what else to do, you can just log on to fb1 and shut it down.  It's not writing any data to any of the network filesystems.

The daqd code under test is from the latest advLigoRTS 3.2.1 tag, which has daqd stability fixes that will hopefully address the problems we were seeing last time I tried this upgrade.  We'll see...

I'm going to let it run over the weekend, and will check in periodically.

  12765   Fri Jan 27 20:52:36 2017 gautamUpdateCDStest of new daqd code on fb1

I'm not sure if this is related, but since today morning, I've noticed that the data concentrator errors have returned. Looking at daqd.log, there is a 1 second timing mismatch error that is being generated. Usually, manually running ntpdate on the front ends fixes this problem, but it did not work today.

Attachment 1: DCerrors.png
DCerrors.png
  12766   Fri Jan 27 21:21:35 2017 gautamUpdateCDSc1pem revamped

The coil and PD BLRMS are useful tools in identifying when glitches occur in the PD  readout, I thought it would be good to install them for ITMY, ETMX and SRM (since I plan to switch the MC3 satellite box, which we suspect to be problematic, with the SRM one). For this purpose, I had to install some IPC SHMEM blocks in C1SUS and recompile. 24 IPC channels were added to pipe the coil, PD and Oplev signals from C1SUS to C1PEM - the recompilation went smoothly, and it doesn't look like the model computation time has increased significantly or that the model is any closer to timing out.

However, I was unable to install the BLRMS blocks in C1PEM, as when I tried to compile the model with BLRMS for these extra 24 channels, I got a compilation error saying that I have exceeded the maximum allowed 499 testpoints per channel. Is there any workaround to this? It would be possible to create a custom BLRMS block that doesn't have all those testpoints, maybe this is the way to go? Especially if we want to install these channels for all our SOS optics, and also replace the current Seismic BLRMS with this scheme for consistency?

GV edit: I have implemented this scheme - after backing up the original BLRMS_2k part, I made a new one with no testpoints and only EPICS readouts. Doing so allowed me to recompile c1pem without any issues, the CPU time seems to have gone up by 3us from ~55us to 58us. So the BLRMS data record is only available at 16Hz, since there are no DQ channels in the BRLMS block - do we want these in any case? Let's see how this does over the weekend...

  12769   Sat Jan 28 12:05:57 2017 jamieUpdateCDStest of new daqd code on fb1
Quote:

I'm not sure if this is related, but since today morning, I've noticed that the data concentrator errors have returned. Looking at daqd.log, there is a 1 second timing mismatch error that is being generated. Usually, manually running ntpdate on the front ends fixes this problem, but it did not work today.

If this problem started before ~4pm on Friday then it's probably unrelated, since I didn't start any of these tests until after that.  If unexplained problem persist then we can try shutting of the fb1 daqd and see if that helps.

  12770   Mon Jan 30 18:41:41 2017 jamieUpdateCDSTEST ABORTED of new daqd code on fb1

I just aborted the fb1 test and reverted everything to the nominal configuration.  Everything looks to be operating nominally.  Front ends are mostly green except for c1rfm and c1asx which are currently not being acquired by the DAQ, and an unknown IPC error with c1daf.  Please let me know if any unusual problems are encountered.

The behavior of daqd on fb1 with the latest release (3.2.1) was not improved.  After turning on the full pipe it was back to crashing every 10 minutes or so when the full and second trend frames were being written out.  lame.  back to the drawing board...

  12776   Tue Jan 31 15:08:13 2017 ericqMetaphysicsCDSMinute Trend Koan

A novice was learning at the feet of Master Daqd. At the end of the lesson he looked through his notes and said, “Master, I have a few questions. May I ask them?”

Master Daqd nodded.

"Do we record minute trends of our data?"

"Yes, we record raw minute trends in /frames/trend/minute_raw"

"I see. Do we back up minute trends?"

"Yes, we back up all frames present in /frames/trend/minute"

"Wait, this means we are not recording our current trends! What is the reason for the existence of seperate minute and minute_raw trends?

“The knowledge you seek can be answered only by the gods.”

"Can we resume recording the minute trends?"

Master Daqd nodded, turned, and threw himself off the railing, falling to his death on the rocks below.

Upon seeing this, the novice was enlightened. He proceeded to investigate how to convert raw minute trends to minute trends so that historical records could be preserved, and precisely when Master Daqd started throwing himself off the mountain when asked to record minute trends.

  12777   Tue Jan 31 17:28:36 2017 ranaSummaryCDSMinute Trend Koan

Someone installed "Debian" on allegra. Why? Dataviewer doesn't work on there. Is there some advantage to making this thing have a different OS than the others? Any objections to going back to Ubuntu12?

  12779   Tue Jan 31 20:25:26 2017 ericqSummaryCDSMinute Trend Koan
Quote:

Someone installed "Debian" on allegra. Why? Dataviewer doesn't work on there. Is there some advantage to making this thing have a different OS than the others? Any objections to going back to Ubuntu12?

My elog negligence punchcard is getting pretty full... It's pretty much for the same reason as using Debian for optimus; much of the workstation software is getting packaged for Debian, which could offload our need for setting things up in a custom 40m way. Hacking the debian-focused software.ligo.org repos into Ubuntu has caused me headaches in the past. Allegra wasn't being used often, so I figured it was a good test bed for trying things out.

The dataviewer issue was dataviewer's inability to pull the `fb` out of `fb:8088` in the NDSSERVER env variable. I made a quick fix for it in the dataviewer launching script, but there is probably a better way to do it.

  12781   Tue Jan 31 22:15:02 2017 JohannesUpdateCDSvme crate backplane adapter boards

I made a crude sketch for how Lydia and I envision the connector situation on the back of the vme crates to be solved. Essentially the side panels of each crate extend about 2" (52 mm) beyond the edge of the DIN connectors. This is plenty of space for a simple PCB board. The connector of choice is D-Sub. We can split the 64 used pins into 2x 37 D-Sub OR (2x25 pin + 1x15pin). The former has fewer cables, but a few excess unused leads. A quick google search showed me that it is much cheaper to get twisted pair cables for 15 and 25 pin D-Subs. From what I remember, the used pins on the DIN connectors are concentrated on the low numbers end and the high numbers end, so might not need the 'middle' connector in many cases if we decide to break it up into three. I have to check this with Lydia though.

The D-Sub connectors would be panel mounted, for which we need a narrow panel piece with dsub cutouts. We can run horizontal struts across the vme crate from side panel to side panel. This way the force upon cable (dis)connection is mostly on the panel which is attached to the struts which are attached to the crate. This will also prevent gravitational sag or cable strain from pulling on the DIN connection, and we can use twisted pair cables with backshell, screws, and strain reliefs.

I was lookng into getting started with the PCB when Altium complained that the license is expired and to renew it. This is a relatively simple board layout so some free software out there is probably enough.

Attachment 1: vme_backplane_conn_sketch.jpg
vme_backplane_conn_sketch.jpg
  12791   Thu Feb 2 18:28:29 2017 ranaSummaryCDSMinute Trend Koan

and the song remains the same...

the version of SVN on these workstations is ahead of the one on the other workstations so now we can't do 'svn up' on any of the Ubuntu12 machines. One allegra and optimus I get this error:

controls@allegra|GWsummaries> svn up
Updating '.':
svn: E180001: Unable to connect to a repository at URL 'file:///cvs/cds/caltech/svn/trunk/GWsummaries'
svn: E180001: Unable to open an ra_local session to URL
svn: E180001: Unable to open repository 'file:///cvs/cds/caltech/svn/trunk/GWsummaries'

Quote:
Quote:

Someone installed "Debian" on allegra. Why? Dataviewer doesn't work on there. Is there some advantage to making this thing have a different OS than the others? Any objections to going back to Ubuntu12?

My elog negligence punchcard is getting pretty full... It's pretty much for the same reason as using Debian for optimus; much of the workstation software is getting packaged for Debian, which could offload our need for setting things up in a custom 40m way. Hacking the debian-focused software.ligo.org repos into Ubuntu has caused me headaches in the past. Allegra wasn't being used often, so I figured it was a good test bed for trying things out.

The dataviewer issue was dataviewer's inability to pull the `fb` out of `fb:8088` in the NDSSERVER env variable. I made a quick fix for it in the dataviewer launching script, but there is probably a better way to do it.

I'm not sure if its possible to downgrade our chans repo back to the old one, but I highly recommend that no one do 'svn upgrade' in any of our repos until we remove all of the Debian installs in the 40m lab or hire a full-time sysadmin.

  12794   Fri Feb 3 11:03:06 2017 jamieUpdateCDSmore testing fb1; DAQ DOWN DURING TEST

More testing of fb1 today.  DAQ DOWN UNTIL FURTHER NOTICE.

Testing Wednesday did not resolve anything, but Jonathan Hanks is helping.

  12796   Fri Feb 3 11:40:34 2017 ericqSummaryCDS/cvs/cds/caltech/chans back on svn1.6

I was able to bring back svn 1.6 formatting to /cvs/cds/caltech/chans by doing the following on nodus:

cd /cvs/cds/caltech
mkdir newchans
cd newchans
svn co https://nodus.ligo.caltech.edu:30889/svn/trunk/chans ./
rm -rf ../chans/.svn
mv ./.svn ../chans/

Note that I used the http address for the repository. The svn repository doesn't live at file:///cvs/cds/caltech/svn anymore; all of our checkouts (e.g. in the scripts directory) use http to get the one true repo location, regardless of where it lives on nodus' filesystem. (I suppose we could also use https://nodus.martian:30889/svn to stick to the local network, but I don't think we're that limited by the caltech network speed)

Presumably, at some point we will want to introduce a newer operating system into the 40m, as ubuntu 12.04 hits end-of-life in April 2017. Ubuntu 16.04 includes svn 1.8, so we'll also hit this issue if we choose that OS. 


Aside from the svn issues, this directory (/cvs/cds/caltech/chans) only contains pre-2010 channels. Filters and DAQ ini files currently live in /opt/rtcds/caltech/c1/chans, which is not under version control. It's also not clear to me why summary page configurations should be kept in this /cvs/cds place.

  12797   Sat Feb 4 12:00:59 2017 ranaSummaryCDS/cvs/cds/caltech/chans back on svn1.6

True - its an issue. Koji and I are updating zita into Ubuntu16 LTS. If it looks like its OK with various tools we'll swap over the others into it. Until then I figure we're best off turning allegra back into Ubuntu12 to avoid a repeat of this kind of conflict. Once the workstations in the LLO control room are running smoothly on a new OS for a year, we can transfer into that. I don't think any of us wants to be the CDS beta tester for DV or DTT.

  12798   Sat Feb 4 12:20:39 2017 jamieSummaryCDS/cvs/cds/caltech/chans back on svn1.6
Quote:

True - its an issue. Koji and I are updating zita into Ubuntu16 LTS. If it looks like its OK with various tools we'll swap over the others into it. Until then I figure we're best off turning allegra back into Ubuntu12 to avoid a repeat of this kind of conflict. Once the workstations in the LLO control room are running smoothly on a new OS for a year, we can transfer into that. I don't think any of us wants to be the CDS beta tester for DV or DTT.

Just to be clear, since there seems to be some confusion, the SVN issue has nothing to do with Debian vs. Ubuntu.  SVN made non-backwards compatible changes to their working copy data format that breaks newer checkouts with older clients.  You will run into the exact same problem with newer Ubuntu versions.

I recommend the 40m start moving towards the reference operating systems (Debian 8 or SL7) as that's where CDS is moving.  By moving to newer Ubuntu versions you're moving away from CDS support, not towards it.

  12799   Sat Feb 4 12:29:20 2017 jamieSummaryCDS/cvs/cds/caltech/chans back on svn1.6

No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.

Quote:
Quote:

True - its an issue. Koji and I are updating zita into Ubuntu16 LTS. If it looks like its OK with various tools we'll swap over the others into it. Until then I figure we're best off turning allegra back into Ubuntu12 to avoid a repeat of this kind of conflict. Once the workstations in the LLO control room are running smoothly on a new OS for a year, we can transfer into that. I don't think any of us wants to be the CDS beta tester for DV or DTT.

Just to be clear, since there seems to be some confusion, the SVN issue has nothing to do with Debian vs. Ubuntu.  SVN made non-backwards compatible changes to their working copy data format that breaks newer checkouts with older clients.  You will run into the exact same problem with newer Ubuntu versions.

I recommend the 40m start moving towards the reference operating systems (Debian 8 or SL7) as that's where CDS is moving.  By moving to newer Ubuntu versions you're moving away from CDS support, not towards it.

 

  12800   Sat Feb 4 12:50:01 2017 jamieSummaryCDS/cvs/cds/caltech/chans back on svn1.6
Quote:

No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.

Ubuntu16 is not to my knowledge used for any CDS system anywhere.  I'm not sure how you expect to have better support for that.  There are no pre-compiled packages of any kind available for Ubuntu16.  Good luck, you big smelly doofuses. Nyah, nyah, nyah.

  12803   Mon Feb 6 15:18:08 2017 gautamUpdateCDSslow machine bootfest

Had to reboot c1psl, c1susaux, c1auxex, c1auxey and c1iscaux today. PMC has been relocked. ITMX didn't get stuck. According to this thread, there have been two instances in the last 10 days in which c1psl and c1susaux have failed. Since we seem to be doing this often lately, I've made a little script that uses the netcat utility to check which slow machines respond to telnet, it is located at /opt/rtcds/caltech/c1/scripts/cds/testSlowMachines.bash.

The script can be executed by ./testSlowMachines.bash.

  12810   Tue Feb 7 19:14:59 2017 JohannesUpdateCDSvme crate backplane adapter board layout

After fighting with Altium for what seems like an eternity I have finished putting my vision of the vme crate backplane adapter board into an electronic format. It is dimensioned to fill the back space of the crate exactly. The connectors are panel mount and the PCB attaches to the connectors with screws, such that the whole thing will be mechanically much more stable than the current configuration. A mounting bracket will attach to horizontal struts that need to be installed in the crates, mechanical drawings to follow.

Attachment 1: vme_backplane.pdf
vme_backplane.pdf
  12893   Mon Mar 20 11:18:58 2017 gautamUpdateCDSNo internet connectivity on control room machines

There is no internet connectivity on any of the control room machines. 

I have been trying to debug by tracing the cabling situation in the rack in the office area, and will update if/when this problem has been resolved. I had last come into the lab on Saturday and there was no problem then. There 40m wireless network servicing the office area seems to work fine.

 

  12894   Mon Mar 20 14:39:44 2017 gautamUpdateCDSNo internet connectivity on control room machines

Koji diagnosed that the NAT router was to blame for this problem. I simply power cycled this router, and now the connectivity has been restored. 

It was possible to log into nodus and then to pianosa - and it was also possible to log into the various control room machines once logged into nodus. However, the outward packets seemed to not get transmitted. Anyways, power cycling the NAT Router unit seems to have done the job.

Quote:

There is no internet connectivity on any of the control room machines. 

I have been trying to debug by tracing the cabling situation in the rack in the office area, and will update if/when this problem has been resolved. I had last come into the lab on Saturday and there was no problem then. There 40m wireless network servicing the office area seems to work fine.

 

 

  12914   Tue Mar 28 21:06:53 2017 ranaSummaryCDS/cvs/cds/caltech/chans back on svn1.6

Debian doesn't like EPICS. Or our XY plots of beam spots...Sad!

Quote:
Quote:

No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.

Ubuntu16 is not to my knowledge used for any CDS system anywhere.  I'm not sure how you expect to have better support for that.  There are no pre-compiled packages of any kind available for Ubuntu16.  Good luck, you big smelly doofuses. Nyah, nyah, nyah.

  12924   Mon Apr 3 17:09:47 2017 gautamUpdateCDSC1PSL burt-restored

When I came in this morning, Steve had re-locked the PMC and IMC - but I could see a ~1Hz intensity fluctuation on the PMC REFL video monitor. I unlocked the PMC and tried to re-lock it, but couldn't using the usual prescription of turning the servo gain down and moving the DC bias slider around. I checked the status of the slow machines - all were responding to pings and could be telnet'ed into, so that didn't seem to be the problem. In the past, this sort of behaviour was characteristic of the infamous "sticky slider" problem - so I simply burt-restored c1psl using a snapshot from 29 March, after which I could easily re-lock the PMC. The transmitted light level looked normal on the scope on the PSL table, and the PMC REFL video monitor also look normal now.

  12980   Wed May 10 12:37:41 2017 gautamUpdateCDSMCautolocker dead

The MCautolocker had stalled - there were no additional lines to the logfile after 12:17pm (~20mins ago). Normally, it suffices to ssh into megatron and run sudo initctl restart MCautolocker - but it seems that there was no running initctl instance of this, so I had to run sudo initctl start MCautolocker. The FSS Slow control initctl process also seemed to have been terminated, so I ran sudo initctl start FSSslowPy.

It is not clear to me why the initctl instances got killed in the first place, but MC locks fine now.

  12982   Wed May 10 16:57:52 2017 ranaUpdateCDSMCautolocker dead

I rebooted megatron around 12:20 today. It had dozens of stalled medm process (some of them there since February!). I couldn't kill them without them coming back like zombies, so I did sudo reboot.

  12991   Mon May 15 08:26:43 2017 ranaUpdateCDSSVN up in userapps/cds

I did an 'svn update' in userapps/cds/ which pulled in some changes from the sites as well as various CDS utilities in common/ and utilities/

This was to get Keith Thorne's get_data.m and get_data2.m scripts which I tested and they seem to be able to get data. No success with getting minute trend yet, but that may be a user error.

Update Monday 15-May: Our version of NDS client is 0.10 and we need to have 0.14 for this new method to work. Ubuntu12 lscsoft repo doesn't have newer nds client so we'll have to upgrade some OS.

  13012   Thu May 25 12:22:59 2017 gautamUpdateCDSslow machine bootfest

After ~3months without any problems on the slow machine front, I had to reboot c1psl, c1susaux and c1iscaux today. The control room StripTool traces were not being displayed for all the PSL channels so I ran testSlowMachines.bash to check the status of the slow machines, which indicated that these three slow machines were dead. After rebooting the slow machines, I had to burt-restore the c1psl snapshot as usual to get the PMC to lock. Now, both PMC and IMC are locked. I also had to restart the StripTool traces (using scripts/general/startStrip.sh) to get the unresponsive traces back online.

Steve tells me that we probably have to do a reboot of the vacuum slow machines sometime soon too, as the MEDM screen for the Vacuum indicator channels are unresponsive.

Quote:

Had to reboot c1psl, c1susaux, c1auxex, c1auxey and c1iscaux today. PMC has been relocked. ITMX didn't get stuck. According to this thread, there have been two instances in the last 10 days in which c1psl and c1susaux have failed. Since we seem to be doing this often lately, I've made a little script that uses the netcat utility to check which slow machines respond to telnet, it is located at /opt/rtcds/caltech/c1/scripts/cds/testSlowMachines.bash.

 

 

  13028   Thu Jun 1 15:37:01 2017 gautamUpdateCDSslow machine bootfest

Steve alerted me that the IMC wouldn't lock. Reboots for c1susaux, c1iool0 today. I tried using the reset button instead of keying the crates. This worked for c1iool0, but not for c1susaux. So I had to key the latter crate. The machine took a good 5-10 minutes before coming back up, but eventually it did. Now IMC locks fine.

  13059   Mon Jun 12 10:34:10 2017 gautamUpdateCDSslow machine bootfest

Reboots for c1susaux, c1iscaux, c1auxex today. I took this opportunity to squish the Sat. Box. Cabling for MC2 (both on the Sat box end and also the vacuum feedthrough) as some work has been recently ongoing there, maybe something got accidently jiggled during the process and was causing MC2 alignment to jump around.

Relocked PMC to offload some of the DC offset, and re-aligned IMC after c1susaux reboot. PMC and IMC transmission back to nominal levels now. Let's see if MC2 is better behaved after this sat. box. voodoo.

Interestingly, since Feb 6, there were no slow machine reboots for almost 3 months, while there have been three reboots in the last three weeks. Not sure what (if anything) to make of that.

  13069   Fri Jun 16 13:53:11 2017 gautamUpdateCDSslow machine bootfest

Reboots for c1psl, c1iool0, c1iscaux today. MC autolocker log was complaining that the C1:IOO-MC_AUTOLOCK_BEAT EPICS channel did not exist, and running the usual slow machine check script revealed that these three machines required reboots. PMC was relocked, IMC Autolocker was restarted on Megatron and everything seems fine now.

 

  13096   Wed Jul 5 16:09:34 2017 gautamUpdateCDSslow machine bootfest

Reboots for c1susaux, c1iscaux today.

 

  13125   Wed Jul 19 08:37:21 2017 JamieUpdateCDSUpdate on front-end/DAQ rebuild

After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ).  The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO.  We're trying to get the front ends working first, and will work on recovering daqd after.

Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image.  We setup fb1 as the new boot server, and were able to get front ends booting again.  Unfortunately, we've been having trouble running and building models, so something is still amis.  We've been taking a three-pronged approach to getting the front ends running:

  • /diskless/root.fb: This involves booting the front ends from the backup of the diskless root from fb.  Runs gentoo kernel 2.6.34.1.  This should correspond to the environment that all models were built and running against.  But something is missing in the configuration.  The front ends were also mounting /opt from fb, which included the dolphin drivers, and we don't have a copy of that, so models aren't loading or recompiling.
  • /diskless/root.x1boot: Keith sent a disk image of the entire x1boot server from LLO.  It uses gentoo kernel 3.0.8.  This ostensibly includes everything we should need to run the front ends, but it's unfortunately configured with newer versions of some of the software and also isn't loading our existing models or building new ones.  This also seems to be having issues with the dolphin drivers.
  • /diskless/root.jessie: This is an entirely new boot image build from scratch with Debian jessie, using an RTS-patched 3.2 kernel.  This would use the latest versions of everything.  It's mostly working, we just need to rebuild the dolphin driver and source.

It seems that in all cases we need to rebuild the dolphin drivers from source.

  13127   Wed Jul 19 14:26:50 2017 JamieUpdateCDSUpdate on front-end/DAQ rebuild

 

Quote:

After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ).  The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO.  We're trying to get the front ends working first, and will work on recovering daqd after.

Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image.  We setup fb1 as the new boot server, and were able to get front ends booting again.  Unfortunately, we've been having trouble running and building models, so something is still amis.  We've been taking a three-pronged approach to getting the front ends running:

  • /diskless/root.fb: This involves booting the front ends from the backup of the diskless root from fb.  Runs gentoo kernel 2.6.34.1.  This should correspond to the environment that all models were built and running against.  But something is missing in the configuration.  The front ends were also mounting /opt from fb, which included the dolphin drivers, and we don't have a copy of that, so models aren't loading or recompiling.
  • /diskless/root.x1boot: Keith sent a disk image of the entire x1boot server from LLO.  It uses gentoo kernel 3.0.8.  This ostensibly includes everything we should need to run the front ends, but it's unfortunately configured with newer versions of some of the software and also isn't loading our existing models or building new ones.  This also seems to be having issues with the dolphin drivers.
  • /diskless/root.jessie: This is an entirely new boot image build from scratch with Debian jessie, using an RTS-patched 3.2 kernel.  This would use the latest versions of everything.  It's mostly working, we just need to rebuild the dolphin driver and source.

It seems that in all cases we need to rebuild the dolphin drivers from source.

To clarify, we're able to boot the x1boot image with the existing 2.6.25 kernel that we have from fb.  The issue with the root.x1boot image is not the kernel version but some of the other support libraries, such as dolphin.

  13130   Fri Jul 21 18:03:17 2017 JamieUpdateCDSUpdate on front-end/DAQ rebuild

Update:

  • front ends booting with the new Debian jessie diskless root image and a linux 3.2 version of the RTS-patched kernel
  • dolphin is configured correctly and running on c1lsc and c1sus
  • models building and running with RCG 3.0.3

Up next:

  • add c1ioo to the dolphin network
  • recompile/restart all front end models
  • daqd

I'll try to get the first two of those done tomorrow, although it's unclear what model updates we'll have to do to get things working with the newer RCG.

 

  13133   Sun Jul 23 22:16:55 2017 Jamie, gautamUpdateCDSfront-end now running with new OS, RCG

All front ends and model are (mostly) running now

All suspensions are damped:

It should be possible at this point to do more recovery, like locking the MC.

Some details on the restore process:

  • all models were recompiled with the new RCG version 3.0.3
  • the new RCG does stricter simulink drawing checks, and was complaining about unterminated outputs in some of the SUS models.  Terminated all outputs it was concerned about and saved.
  • RCG 3.0 requires a new directory for doing better filter module diagnostics: /opt/rtcds/caltech/c1/chans/tmp
  • had to reset the slow machines c1susaux, c1auxex, c1auxey

The daqd is not yet running.  This is the next task.

I have been taking copious notes and will fully document the restore process once complete.

c1ioo issues

c1ioo has been giving us a little bit of trouble.  The c1ioo model kept crashing and taking down the whole c1ioo host.  We found a red light on one of the ADCs (ADC1).  We pulled the card and replaced it with a spare from the CDS cabinet.  That seemed to fix the problem and c1ioo became more stable.

We've still been seeing a lot of glitching in c1ioo, though, with CPU cycle times frequently (every couple of seconds) running above threshold for all models, up to 200 us.  I tried unloading every kernel module I could and shutting down every non-critical process, but nothing seemed to help.

We eventually tried stopping the c1ioo model altogether and that seemed to help quite a bit, dropping the long cycle rate down to something like one every 30 seconds or so.  Not sure what that means.  We should look into the BIOS again, to see if there could be something interacting with the newer kernel.

So currently the c1ioo model is not running (which is why it's all white in the CDS overview snapshot above).  The fact that c1ioo is not running and the remaining models are still occaissionly glitching is also causing various IPC errors on auxilliary models (see c1mcs, c1rfm, c1ass, c1asx). 

RCG compile warnings

the new RCG tries to do more checks on custom c code, but it seems to be having trouble finding our custom "ccodeio.h" files that live with the c definitions in USERAPPS/*/common/src/.  Unclear why yet.  This is causing the RCG to spit out warnings like the following:

Cannot verify the number of ins/outs for C function BLRMS.
    File is /opt/rtcds/userapps/release/cds/c1/src/BLRMSFILTER.c
    Please add file and function to CDS_SRC or CDS_IFO_SRC ccodeio.h file.

This are just warnings and will not prevent the model form compiling or warning.  We'll figure out what the problem is to make these go away, but they can be ignored for the time being.

model unload instability

Probably the worst problem we're facing right now is an instability that will occaissionally, but not always, cause the entire front end host to freeze up upon unloading an RTS kernel module.  This is a known issue with the newer linux kernels (we're using kernel version 3.2.35), and is being looked into.

This is particularly annoying with the machines on the dolphin network, since if one of the dolphin hosts goes down it manages to crash all the models reading from the dolphin network.  Since half the time they can't be cleanly restarted, this tends to cause a boot fest with c1sus, c1lsc, and c1ioo.  If this happens, just restart those machines, wait till they've all fully booted, then restart all the models on all hosts with "rtcds start all".

  13135   Mon Jul 24 10:45:23 2017 gautamUpdateCDSc1iscex models died

This morning, all the c1iscex models were dead. Attachment #1 shows the state of the cds overview screen when I came in. The machine itself was ssh-able, so I just restarted all the models and they came back online without fuss.

Quote:

All front ends and model are (mostly) running now

Attachment 1: c1iscexFailure.png
c1iscexFailure.png
  13136   Mon Jul 24 10:59:08 2017 JamieUpdateCDSc1iscex models died
Quote:

This morning, all the c1iscex models were dead. Attachment #1 shows the state of the cds overview screen when I came in. The machine itself was ssh-able, so I just restarted all the models and they came back online without fuss.

This was me.  I had rebooted that machine and hadn't restarted the models.  Sorry for the confusion.

  13138   Mon Jul 24 19:28:55 2017 JamieUpdateCDSfront end MX stream network working, glitches in c1ioo fixed

MX/OpenMX network running

Today I got the mx/open-mx networking working for the front ends.  This required some tweaking to the network interface configuration for the diskless front ends, and recompiling mx and open-mx for the newer kernel.  Again, this will all be documented.

controls@fb1:~ 0$ /opt/mx/bin/mx_info
MX Version: 1.2.16
MX Build: root@fb1:/opt/src/mx-1.2.16 Mon Jul 24 11:33:57 PDT 2017
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
    8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0:  364.4 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
    Status:        Running, P0: Link Up
    Network:    Ethernet 10G

    MAC Address:    00:60:dd:43:74:62
    Product code:    10G-PCIE-8B-S
    Part number:    09-04228
    Serial number:    485052
    Mapper:        00:60:dd:43:74:62, version = 0x00000000, configured
    Mapped hosts:    6

                                                        ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                        P0
-----    -----------     ---------                        ---
   0) 00:60:dd:43:74:62 fb1:0                             1,0
   1) 00:30:48:be:11:5d c1iscex:0                         1,0
   2) 00:30:48:bf:69:4f c1lsc:0                           1,0
   3) 00:25:90:0d:75:bb c1sus:0                           1,0
   4) 00:30:48:d6:11:17 c1iscey:0                         1,0
   5) 00:14:4f:40:64:25 c1ioo:0                           1,0
controls@fb1:~ 0$

c1ioo timing glitches fixed

I also checked the BIOS on c1ioo and found that the serial port was enabled, which is known to cause timing glitches.  I turned off the serial port (and some power management stuff), and rebooted, and all the c1ioo timing glitches seem to have gone away.

It's unclear why this is a problem that's just showing up now.  Serial ports have always been a problem, so it seems unlikely this is just a problem with the newer kernel.  Could the BIOS have somehow been reset during the power glitch?

In any event, all the front ends are now booting cleanly, with all dolphin and mx networking coming up automatically, and all models running stably:

Now for daqd...

  13139   Mon Jul 24 19:57:54 2017 gautamUpdateCDSIMC locked, Autolocker re-enabled

Now that all the front end models are running, I re-aligned the IMC, locked it manually, and then tweaked the alignment some more. The IMC transmission now is hovering around 15300 counts. I re-enabled the Autolocker and FSS Slow loops on Megatron as well.

Quote:

MX/OpenMX network running

Today I got the mx/open-mx networking working for the front ends.  This required some tweaking to the network interface configuration for the diskless front ends, and recompiling mx and open-mx for the newer kernel.  Again, this will all be documented.

 

  13145   Wed Jul 26 19:13:07 2017 JamieUpdateCDSdaqd showing same instability as before

I recompiled daqd on the updated fb1, similar to how I had before, and we're seeing the same instability: process crashes when it tries to write out the second trend (technically it looks like it crashes while it's trying to write out the full frame while the second trend is also being written out).  Jonathan Hanks and I are actively looking into it and i'll provide further report soon.

  13149   Fri Jul 28 20:22:41 2017 JamieUpdateCDSpossible stable daqd configuration with separate DC and FW

This week Jonathan Hanks and I have been trying to diagnose why the daqd has been unstable in the configuration used by the 40m, with data concentrator (dc) and frame writer (fw) in the same process (referred to generically as 'fb').  Jonathan has been digging into the core dumps and source to try to figure out what's going on, but he hasn't come up with anything concrete yet.

As an alternative, we've started experimenting with a daqd configuration with the dc and fw components running in separate processes, with communication over the local loopback interface.  The separate dc/fw process model more closely matches the configuration at the sites, although the sites put dc and fwprocesses on different physical machines.  Our experimentation thus far seems to indicate that this configuration is stable, although we haven't yet tested it with the full configuration, which is what I'm attempting to do now.

Unfortunately I'm having trouble with the mx_stream communication between the front ends and the dc process.  The dc does not appear to be receiving the streams from the front ends and is producing a '0xbad' status message for each.  I'm investigating.

  13152   Mon Jul 31 15:13:24 2017 gautamUpdateCDSFB ---> FB1

[jamie, gautam]

In order to test the new daqd config that Jamie has been working on, we felt it would be most convenient for the host name "fb" (martian network IP 192.168.113.202) to point to the physical machine "fb1" (martian network IP 192.168.113.201).

I made this change in /var/lib/bind/martian.hosts on chiara, and then ran sudo service bind9 restart. It seems to have done the job. So as things stand, both hostnames "fb" and "fb1" point to 192.168.113.201.

Now, when starting up DTT or dataviewer, the NDS server is automatically found.

More details to follow.

  13153   Mon Jul 31 18:44:40 2017 JamieUpdateCDSCDS system essentially fully recovered

The CDS system is mostly fully recovered at this point.  The mx_streams are all flowing from all front ends, and from all models, and the daqd processes are receiving them and writing the data to frames:

Remaining unresolved issues:

  • IFO needs to be fully locked to make sure ALL components of all models are working.
  • The remaining red status lights are from the "FB NET" diagnostics, which are reflecting a missing status bit from the front end processes due to the fact that they were compiled with an earlier RCG version (3.0.3) than the mx_streams were (3.3+/trunk).  There will be a new release of the RTS soon, at which point we'll compile everything from the same version, which should get us all green again.
  • The entire system has been fully modernized, to the target CDS reference OS (Debian jessie) and more recent RCG versions.  The management of the various RTS components, both on the front ends and on fb, have as much as possible been updated to use the modern management tools (e.g. systemd, udev, etc.).  These changes need to be documented.  In particular...
  • The fb daqd process has been split into three separate components, a configuration that mirrors what is done at the sites and appears to be more stable: The "target" directory for all of these components is now:
    • daqd_dc: data concentrator (receives data from front ends)
    • daqd_fw: receives frames from dc and writes out full frames and second/minute trends
    • daqd_rcv: NDS1 server (raises test points and receives archive data from frames from 'nds' process)
    The "target" directory for all of these new components is:
    • /opt/rtcds/caltech/c1/target/daqd
    All of these processes are now managed under systemd supervision on fb, meaning the daqd restart procedure has changed.  This needs to be simplified and clarified.
  • Second trend frames are being written, but for some reason they're not accessible over NDS.
  • Have not had a chance to verify minute trend and raw minute trend writing yet.  Needs to be confirmed.
  • Get wiper script working on new fb.
  • Front end RTS kernel will occaissionally crash when the RTS modules are unloaded.  Keith Thorne apparently has a kernel version with a different set of patches from Gerrit Kuhn that does not have this problem.  Keith's kernel needs to be packaged and installed in the front end diskless root.
  • The models accessing the dolphin shared memory will ALL crash when one of the front end hosts on the dolphin network goes away.  This results in a boot fest of all the dolphin-enabled hosts.  Need to figure out what's going on there.
  • The RCG settings snapshotting has changed significantly in later RCG versions.  We need to make sure that all burt backup type stuff is still working correctly.
  • Restoration of /frames from old fb SCSI RAID?
  • Backup of entirety of fb1, including fb1 root (/) and front end diskless root (/diskless)
  • Full documentation of rebuild procedure from Jamie's notes.
  13161   Thu Aug 3 00:59:33 2017 gautamUpdateCDSNDS2 server restarted, /frames mounted on megatron

[Koji, Nikhil, Gautam]

We couldn't get data using python nds2. There seems to have been many problems.

  1. /frames wasn't mounted on megatron, which was the nds2 server. Solution: added /frames 192.168.113.209(sync,ro,no_root_squash,no_all_squash,no_subtree_check) to /etc/exportfs on fb1, followed by sudo exportfs -ra. Using showmount -e, we confirmed that /frames was being exported.
  2. Edited /etc/fstab on megatron to be fb1:/frames/ /frames nfs ro,bg,soft 0 0. Tried to run mount -a, but console stalled.
  3. Used nfsstat -m on megatron. Found out that megatron was trying to mount /frames from old FB (192.168.113.202). Used sudo umount -f /frames to force unmount /frames/ (force was required).
  4. Re-ran mount -a on megatron.
  5. Killed nds2 using /etc/init.d/nds2 stop - didn't work, so we manually kill -9'ed it.
  6. Restarted nds2 server using /etc/init.d/nds2 start.
  7. Waited for ~10mins before everything started working again. Now usual nds2 data getting methods work.

I have yet to check about getting trend data via nds2, can't find the syntax. EDIT: As Jamie mentioned in his elog, the second trend data is being written but is inaccessible over nds (either with dataviewer, which uses fb as the ndsserver, or with python NDS, which uses megatron as the ndsserver). So as of now, we cannot read any kind of trends directly, although the full data can be downloaded from the past either with dataviewer or python nds2. On the control room workstations, this can also be done with cds.getdata.

  13162   Thu Aug 3 10:51:32 2017 ranaUpdateCDSNDS2 server restarted, /frames mounted on megatron

same issue on NODUS; I edited the /etc/fstab and tried mount -a, but it gives this error:

controls@nodus|~ 1> sudo mount -a
mount.nfs: access denied by server while mounting fb1:/frames

needs more debugging - this is the machine that allows us to have backed up frames in LDAS. Permissions issues from fb1 ?

  13163   Thu Aug 3 11:11:29 2017 gautamUpdateCDSNDS2 server restarted, /frames mounted on nodus

I added nodus' eth0 IP (192.168.113.200) to the list of allowed nfs clients in /etc/exportfs on fb1, and then ran sudo mount -a on nodus. Now /frames is mounted.

Quote:

needs more debugging - this is the machine that allows us to have backed up frames in LDAS. Permissions issues from fb1 ?

 

  13164   Thu Aug 3 19:46:27 2017 JamieUpdateCDSnew daqd restart procedure

This is the daqd restart procedure:

$ ssh fb1 sudo systemctl restart daqd_*

That will restart all of the daqd services (daqd_dc, daqd_fw, daqd_rcv).

The front end mx_stream processes should all auto-restart after the daqd_dc comes back up.  If they don't (models show "0x2bad" on DC0_*_STATUS) then you can execute the following to restart the mx_stream process on the front end:

$ ssh c1<host> sudo systemctl restart mx_stream

 

 

  13165   Thu Aug 3 20:15:11 2017 JamieUpdateCDSdataviewer can not raise test points

For some reason dataviewer is not able to raise test points with the new daqd setup, even though dtt can.  If you raise a test point with dtt then dataviewer can show the data fine.

It's unclear to me why this would be the case.  It might be that all the versions of dataviewer on the workstations are too old??  I'll look into it tomorrow to see if I can figure out what's going on.

  13166   Fri Aug 4 09:07:28 2017 ranaUpdateCDSCDS system essentially NOT fully recovered

Tried getting trends with dataviewer just now since Jamie re-enabled the minute_raw frame writing yesterday. Unable to get trends still:

Connecting to NDS Server fb1 (TCP port 8088)
Connecting.... done
Server error 18: trend data is not available
datasrv: DataWriteTrend failed in daq_send().
unknown error returned from daq_send()T0=17-08-04-08-02-22; Length=28800 (s)
No data output.

  13185   Thu Aug 10 14:25:52 2017 gautamUpdateCDSSlow EPICS channels -> Frames re-enabled

I went into /opt/rtcds/caltech/c1/target/daqd, opened the master file, and uncommented the line with C0EDCU.ini (this is the file in which all the slow machine channels are defined). So now I am able to access, for example, the c1vac1 channels.

The location of the master file is no longer in /opt/rtcds/caltech/c1/target/fb, but is in the above mentioned directory instead. This is part of the new daqd paradigm in which separate processes are handling the data transfer between FEs and FB, and the actual frame-writing. Jamie will explain this more when he summarizes the CDS revamp.

It looks like trend data is also available for these newly enabled channels, but thus far, I've only checked second trends. I will update with a more exhaustive check later in the evening.

So, the two major pending problems (that I can think of) are:

  1. Inability to unload models cleanly
  2. Inability of dataviewer (and cdsutils) to open testpoints.

Apart from this, dataviewer frequently hangs on Donatella at startup. I used ipcs -a | grep 0x | awk '{printf( "-Q %s ", $1 )}' | xargs ipcrm to remove all the extra messages in the dataviewer queue.


Restarting the daqd processes on fb1 using Jamie's instructions from earlier in this thread works - but the mx_stream processes do not seem to come back automatically on c1lsc, c1sus and c1ioo (reasons unknown). I've made a copy of the mxstreamrestart.sh script with the new mxstream restart commands, called mxstreamrestart_debian.sh, which lives in /opt/rtcds/caltech/c1/scripts/cds. I've also modified the CDS overview MEDM screen such that the "mxstream restart" calls this modified script. For now, this requires you to enter the controls password for each machine. I don't know what is a secure way to do it otherwise, but I recall not having to do this in the past with the old mxstreamrestart.sh script.

  13189   Fri Aug 11 00:10:03 2017 gautamUpdateCDSSlow EPICS channels -> Frames re-enabled

Seems like something has failed after I did this - full frames are no longer on Aug 10 being written since ~2.30pm PDT. I found out when I tried to download some of the free-swinging MC1 data.

To clarify, I logged into fb1, and ran sudo systemctl restart daqd_*. The only change I made was to uncomment the line quoted below in the master file.

Looking at the log using systemctl, I see the following (I just tried restarting the daqd processes again):

Aug 11 00:00:31 fb1 daqd_fw[16149]: LDASUnexpected::unexpected: Caught unexpected exception      "This is a bug. Please log an LDAS problem report including this message.
Aug 11 00:00:31 fb1 daqd_fw[16149]: daqd_fw: LDASUnexpected.cc:131: static void LDASTools::Error::LDASUnexpected::unexpected(): Assertion `false' failed.
Aug 11 00:00:32 fb1 systemd[1]: daqd_fw.service: main process exited, code=killed, status=6/ABRT
Aug 11 00:00:32 fb1 systemd[1]: Unit daqd_fw.service entered failed state.
Aug 11 00:00:32 fb1 systemd[1]: daqd_fw.service holdoff time over, scheduling restart.
Aug 11 00:00:32 fb1 systemd[1]: Stopping Advanced LIGO RTS daqd frame writer...
Aug 11 00:00:32 fb1 systemd[1]: Starting Advanced LIGO RTS daqd frame writer...
Aug 11 00:00:32 fb1 systemd[1]: daqd_fw.service start request repeated too quickly, refusing to start.
Aug 11 00:00:32 fb1 systemd[1]: Failed to start Advanced LIGO RTS daqd frame writer.
Aug 11 00:00:32 fb1 systemd[1]: Unit daqd_fw.service entered failed state.

Oddly, I am able to access second trends for the same channels from the past which will be useful for the MC1 debugging). Not sure whats going on.


The live data grabbing using cdsutils still seems to be working though - so I've kicked MC1 again, and am grabbing 2 hours of data live on Pianosa.

Quote:

I went into /opt/rtcds/caltech/c1/target/daqd, opened the master file, and uncommented the line with C0EDCU.ini (this is the file in which all the slow machine channels are defined). So now I am able to access, for example, the c1vac1 channels.

The location of the master file is no longer in /opt/rtcds/caltech/c1/target/fb, but is in the above mentioned directory instead. This is part of the new daqd paradigm in which separate processes are handling the data transfer between FEs and FB, and the actual frame-writing. Jamie will explain this more when he summarizes the CDS revamp.

It looks like trend data is also available for these newly enabled channels, but thus far, I've only checked second trends. I will update with a more exhaustive check later in the evening.

So, the two major pending problems (that I can think of) are:

  1. Inability to unload models cleanly
  2. Inability of dataviewer (and cdsutils) to open testpoints.

Apart from this, dataviewer frequently hangs on Donatella at startup. I used ipcs -a | grep 0x | awk '{printf( "-Q %s ", $1 )}' | xargs ipcrm to remove all the extra messages in the dataviewer queue.


Restarting the daqd processes on fb1 using Jamie's instructions from earlier in this thread works - but the mx_stream processes do not seem to come back automatically on c1lsc, c1sus and c1ioo (reasons unknown). I've made a copy of the mxstreamrestart.sh script with the new mxstream restart commands, called mxstreamrestart_debian.sh, which lives in /opt/rtcds/caltech/c1/scripts/cds. I've also modified the CDS overview MEDM screen such that the "mxstream restart" calls this modified script. For now, this requires you to enter the controls password for each machine. I don't know what is a secure way to do it otherwise, but I recall not having to do this in the past with the old mxstreamrestart.sh script.

 

ELOG V3.1.3-