40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 54 of 346  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  12700   Tue Jan 10 21:47:00 2017 ranaUpdateCDSpower glitch

Does "done" mean they are OK or they are somehow damaged? Do you mean the workstations or the front end machines?

The computers are all done.

megatron and optimus are not responding to ping commands or ssh -- please power them up if they are off; we need them to get data remotely

  12701   Tue Jan 10 22:55:43 2017 gautamUpdateCDSpower glitch - recovery steps

Here is a link to an elog with the steps I had to follow the last time there was a similar power glitch.

The RAID array restart was also done not too long ago, we should also do a data consistency check as detailed here, if not already..

If someone hasn't found the time to do this, I can take care of it tomorrow afternoon after I am back.

Quote:

Does "done" mean they are OK or they are somehow damaged? Do you mean the workstations or the front end machines?

The computers are all done.

megatron and optimus are not responding to ping commands or ssh -- please power them up if they are off; we need them to get data remotely

 

  12702   Wed Jan 11 16:35:03 2017 gautamUpdateCDSpower glitch - recovery progress

[lydia, ericq, gautam]

We set about following the instructions linked in the previous elog. A few notes/remarks:

  1. It is important to run the ntpdate commands before restarting the models. Sometimes, multiple restarts of the models were required to turn all the indicator blocks on the MEDM screen green.
  2. There was also an issue of multiple ntpd processes running on the same machine, which obviously caused all sorts of timing havoc. EricQ helped us diagnose and fix these. At the moment, all the lights are green on the CDS status MEDM screen
  3. On the hardware side, apart from the usual suspects of frontends/megatron/optimus/fb needing to be rebooted, I noticed that the ETMX OSEM lights were off on the control room monitors. Investigation pointed to the 2 20V sorensens at the X end outputting 0V, 0A after the power glitch. We turned down both dials, and then gradually ramped them up again. Both Sorensens now read +/-20V, 0.3A, which is in agreement with the label stuck onto them.
  4. Restarted MC autolocker and FSS Slow scripts on megatron. I have not yet looked at the status of the nds2 server on megatron.
  5. 11 MHz Marconi has yet to be restarted - but I am unable to get even the IMC locked at the moment. For some reason, the RMS of the MC1 and MC3 coils are way higher than what I am used to seeing (~5mV rms as compared to the <1mV rms I am used to seeing for a damped optic). I will investigate further. Leaving MC autolocker disabled for now.
  12708   Thu Jan 12 17:31:51 2017 gautamUpdateCDSDC errors

The IFO is more or less back to an operational state. Some details:

  1. The IMC mirror excess motion alluded to in the previous elog was due to some timing issues on c1sus. The "DAC" and "DK" blocks in the c1x02 diag word were red instead of green. Restarting all the models on c1sus fixed the problem
  2. When c1ioo was restarted, all of Koji's changes (digital) to the MC WFS servo where lost as they were not committed to the SDF. Eric suggested that I could just restore them from burt snapshots, which is what I did. I used the c1iooepics.snap file from 12:19PM PST on 26 December 2016, which was a time when the WFS servo was working well as per this elog by Koji. I have also committed all the changes to the SDF. IMC alignment has been stable for the last 4 hours.
  3. Johannes aligned and locked the arms today. There was a large DC offset on POX11, which was zeroed out by closing the PSL shutter and running LSC offsets. Both arms lock and stay aligned now.
  4. The doubling oven controller at the Y end was switched off. Johannes turned it on.
  5. Eric and I started a data consistency check on the RAID array yesterday, it has completed today and indicated no issues
  6. NDS2 is now running again on megatron so channel access from outside should(???) be possible again.

One error persists - the "DC" indicator (data concentrator?) on the CDS medm screen for the various models spontaneously go red and return to green often. Is this a known issue with an easy fix?

  12715   Fri Jan 13 21:41:23 2017 KojiUpdateCDSDC errors

I think I fixed the DC error issue

1. I added the leap second (leapsecond ?) entry for 2016/12/31, 23:60:00 UTC to daqdrc


[OLD]
set gps_leaps = 820108813 914803214 1119744016;
[NEW]
set gps_leaps = 820108813 914803214 1119744016 1167264018;

2. Restarted FB and all realtime models

Now I don't see any RED light.

  12727   Tue Jan 17 20:47:23 2017 ranaUpdateCDSSimulink Webview updated

Seems like this stops working every ~2 years. Its been busted since early 2016 according to cron, so I fixed up the paths and restored some missing files and committed things to the SVN (with comments!) and now its working and grabbing the Web viewable versions of the front end models. Just need to restore its viewability and then the world can watch our models any time.

Quote:

Back in 2011, JoeB wrote some entries on how to automatically update the Simulink webview stuff.

Somehow, the cron broke down over the years. I reran the matlab file by hand today and it worked fine, so now you can see the up to date models using the internet.

https://nodus.ligo.caltech.edu:30889/FE/

 

  12754   Wed Jan 25 14:30:20 2017 gautamUpdateCDSslow machine bootfest

[gautam, lydia]

We rebooted c1psl, c1iscaux and c1aux which were all showing the typical symptom of responding to ping but not to telnet (and also blanked out epics fields on the MEDM screens). Keyed all these crates.

Restored burt snapshots for c1psl, PMC locked fine, and IMC is also locked now.

Johannes forgot to elog this yesterday, but he rebooted c1susaux following the usual procedure to avoid getting ITMX stuck. 

 

  12762   Fri Jan 27 17:07:52 2017 LydiaUpdateCDSslow machine bootfest

Rebooted c1iscaux, c1auxex and c1auxey which were all not reponding to telnet. The watchdogs for the ETMs were turned off and then I keyed all 3 crates. All slow machines are reponding to telnet now. Both green lasers locked to the arms so I didn't do any burt restore.

  12763   Fri Jan 27 17:49:41 2017 jamieUpdateCDStest of new daqd code on fb1

Just FYI I'm running a test of updated daqd code on fb1. 

fb1 has it's own fiber to the daq network switch, so nothing had to be modified to do this test. This *should* not affect anything in the rest of the system, but as we all know these are famous last words....  If something is going haywire, and you can't get in touch with me and can't figure what else to do, you can just log on to fb1 and shut it down.  It's not writing any data to any of the network filesystems.

The daqd code under test is from the latest advLigoRTS 3.2.1 tag, which has daqd stability fixes that will hopefully address the problems we were seeing last time I tried this upgrade.  We'll see...

I'm going to let it run over the weekend, and will check in periodically.

  12765   Fri Jan 27 20:52:36 2017 gautamUpdateCDStest of new daqd code on fb1

I'm not sure if this is related, but since today morning, I've noticed that the data concentrator errors have returned. Looking at daqd.log, there is a 1 second timing mismatch error that is being generated. Usually, manually running ntpdate on the front ends fixes this problem, but it did not work today.

  12766   Fri Jan 27 21:21:35 2017 gautamUpdateCDSc1pem revamped

The coil and PD BLRMS are useful tools in identifying when glitches occur in the PD  readout, I thought it would be good to install them for ITMY, ETMX and SRM (since I plan to switch the MC3 satellite box, which we suspect to be problematic, with the SRM one). For this purpose, I had to install some IPC SHMEM blocks in C1SUS and recompile. 24 IPC channels were added to pipe the coil, PD and Oplev signals from C1SUS to C1PEM - the recompilation went smoothly, and it doesn't look like the model computation time has increased significantly or that the model is any closer to timing out.

However, I was unable to install the BLRMS blocks in C1PEM, as when I tried to compile the model with BLRMS for these extra 24 channels, I got a compilation error saying that I have exceeded the maximum allowed 499 testpoints per channel. Is there any workaround to this? It would be possible to create a custom BLRMS block that doesn't have all those testpoints, maybe this is the way to go? Especially if we want to install these channels for all our SOS optics, and also replace the current Seismic BLRMS with this scheme for consistency?

GV edit: I have implemented this scheme - after backing up the original BLRMS_2k part, I made a new one with no testpoints and only EPICS readouts. Doing so allowed me to recompile c1pem without any issues, the CPU time seems to have gone up by 3us from ~55us to 58us. So the BLRMS data record is only available at 16Hz, since there are no DQ channels in the BRLMS block - do we want these in any case? Let's see how this does over the weekend...

  12769   Sat Jan 28 12:05:57 2017 jamieUpdateCDStest of new daqd code on fb1
Quote:

I'm not sure if this is related, but since today morning, I've noticed that the data concentrator errors have returned. Looking at daqd.log, there is a 1 second timing mismatch error that is being generated. Usually, manually running ntpdate on the front ends fixes this problem, but it did not work today.

If this problem started before ~4pm on Friday then it's probably unrelated, since I didn't start any of these tests until after that.  If unexplained problem persist then we can try shutting of the fb1 daqd and see if that helps.

  12770   Mon Jan 30 18:41:41 2017 jamieUpdateCDSTEST ABORTED of new daqd code on fb1

I just aborted the fb1 test and reverted everything to the nominal configuration.  Everything looks to be operating nominally.  Front ends are mostly green except for c1rfm and c1asx which are currently not being acquired by the DAQ, and an unknown IPC error with c1daf.  Please let me know if any unusual problems are encountered.

The behavior of daqd on fb1 with the latest release (3.2.1) was not improved.  After turning on the full pipe it was back to crashing every 10 minutes or so when the full and second trend frames were being written out.  lame.  back to the drawing board...

  12776   Tue Jan 31 15:08:13 2017 ericqMetaphysicsCDSMinute Trend Koan

A novice was learning at the feet of Master Daqd. At the end of the lesson he looked through his notes and said, “Master, I have a few questions. May I ask them?”

Master Daqd nodded.

"Do we record minute trends of our data?"

"Yes, we record raw minute trends in /frames/trend/minute_raw"

"I see. Do we back up minute trends?"

"Yes, we back up all frames present in /frames/trend/minute"

"Wait, this means we are not recording our current trends! What is the reason for the existence of seperate minute and minute_raw trends?

“The knowledge you seek can be answered only by the gods.”

"Can we resume recording the minute trends?"

Master Daqd nodded, turned, and threw himself off the railing, falling to his death on the rocks below.

Upon seeing this, the novice was enlightened. He proceeded to investigate how to convert raw minute trends to minute trends so that historical records could be preserved, and precisely when Master Daqd started throwing himself off the mountain when asked to record minute trends.

  12777   Tue Jan 31 17:28:36 2017 ranaSummaryCDSMinute Trend Koan

Someone installed "Debian" on allegra. Why? Dataviewer doesn't work on there. Is there some advantage to making this thing have a different OS than the others? Any objections to going back to Ubuntu12?

  12779   Tue Jan 31 20:25:26 2017 ericqSummaryCDSMinute Trend Koan
Quote:

Someone installed "Debian" on allegra. Why? Dataviewer doesn't work on there. Is there some advantage to making this thing have a different OS than the others? Any objections to going back to Ubuntu12?

My elog negligence punchcard is getting pretty full... It's pretty much for the same reason as using Debian for optimus; much of the workstation software is getting packaged for Debian, which could offload our need for setting things up in a custom 40m way. Hacking the debian-focused software.ligo.org repos into Ubuntu has caused me headaches in the past. Allegra wasn't being used often, so I figured it was a good test bed for trying things out.

The dataviewer issue was dataviewer's inability to pull the `fb` out of `fb:8088` in the NDSSERVER env variable. I made a quick fix for it in the dataviewer launching script, but there is probably a better way to do it.

  12781   Tue Jan 31 22:15:02 2017 JohannesUpdateCDSvme crate backplane adapter boards

I made a crude sketch for how Lydia and I envision the connector situation on the back of the vme crates to be solved. Essentially the side panels of each crate extend about 2" (52 mm) beyond the edge of the DIN connectors. This is plenty of space for a simple PCB board. The connector of choice is D-Sub. We can split the 64 used pins into 2x 37 D-Sub OR (2x25 pin + 1x15pin). The former has fewer cables, but a few excess unused leads. A quick google search showed me that it is much cheaper to get twisted pair cables for 15 and 25 pin D-Subs. From what I remember, the used pins on the DIN connectors are concentrated on the low numbers end and the high numbers end, so might not need the 'middle' connector in many cases if we decide to break it up into three. I have to check this with Lydia though.

The D-Sub connectors would be panel mounted, for which we need a narrow panel piece with dsub cutouts. We can run horizontal struts across the vme crate from side panel to side panel. This way the force upon cable (dis)connection is mostly on the panel which is attached to the struts which are attached to the crate. This will also prevent gravitational sag or cable strain from pulling on the DIN connection, and we can use twisted pair cables with backshell, screws, and strain reliefs.

I was lookng into getting started with the PCB when Altium complained that the license is expired and to renew it. This is a relatively simple board layout so some free software out there is probably enough.

  12791   Thu Feb 2 18:28:29 2017 ranaSummaryCDSMinute Trend Koan

and the song remains the same...

the version of SVN on these workstations is ahead of the one on the other workstations so now we can't do 'svn up' on any of the Ubuntu12 machines. One allegra and optimus I get this error:

controls@allegra|GWsummaries> svn up
Updating '.':
svn: E180001: Unable to connect to a repository at URL 'file:///cvs/cds/caltech/svn/trunk/GWsummaries'
svn: E180001: Unable to open an ra_local session to URL
svn: E180001: Unable to open repository 'file:///cvs/cds/caltech/svn/trunk/GWsummaries'

Quote:
Quote:

Someone installed "Debian" on allegra. Why? Dataviewer doesn't work on there. Is there some advantage to making this thing have a different OS than the others? Any objections to going back to Ubuntu12?

My elog negligence punchcard is getting pretty full... It's pretty much for the same reason as using Debian for optimus; much of the workstation software is getting packaged for Debian, which could offload our need for setting things up in a custom 40m way. Hacking the debian-focused software.ligo.org repos into Ubuntu has caused me headaches in the past. Allegra wasn't being used often, so I figured it was a good test bed for trying things out.

The dataviewer issue was dataviewer's inability to pull the `fb` out of `fb:8088` in the NDSSERVER env variable. I made a quick fix for it in the dataviewer launching script, but there is probably a better way to do it.

I'm not sure if its possible to downgrade our chans repo back to the old one, but I highly recommend that no one do 'svn upgrade' in any of our repos until we remove all of the Debian installs in the 40m lab or hire a full-time sysadmin.

  12794   Fri Feb 3 11:03:06 2017 jamieUpdateCDSmore testing fb1; DAQ DOWN DURING TEST

More testing of fb1 today.  DAQ DOWN UNTIL FURTHER NOTICE.

Testing Wednesday did not resolve anything, but Jonathan Hanks is helping.

  12796   Fri Feb 3 11:40:34 2017 ericqSummaryCDS/cvs/cds/caltech/chans back on svn1.6

I was able to bring back svn 1.6 formatting to /cvs/cds/caltech/chans by doing the following on nodus:

cd /cvs/cds/caltech
mkdir newchans
cd newchans
svn co https://nodus.ligo.caltech.edu:30889/svn/trunk/chans ./
rm -rf ../chans/.svn
mv ./.svn ../chans/

Note that I used the http address for the repository. The svn repository doesn't live at file:///cvs/cds/caltech/svn anymore; all of our checkouts (e.g. in the scripts directory) use http to get the one true repo location, regardless of where it lives on nodus' filesystem. (I suppose we could also use https://nodus.martian:30889/svn to stick to the local network, but I don't think we're that limited by the caltech network speed)

Presumably, at some point we will want to introduce a newer operating system into the 40m, as ubuntu 12.04 hits end-of-life in April 2017. Ubuntu 16.04 includes svn 1.8, so we'll also hit this issue if we choose that OS. 


Aside from the svn issues, this directory (/cvs/cds/caltech/chans) only contains pre-2010 channels. Filters and DAQ ini files currently live in /opt/rtcds/caltech/c1/chans, which is not under version control. It's also not clear to me why summary page configurations should be kept in this /cvs/cds place.

  12797   Sat Feb 4 12:00:59 2017 ranaSummaryCDS/cvs/cds/caltech/chans back on svn1.6

True - its an issue. Koji and I are updating zita into Ubuntu16 LTS. If it looks like its OK with various tools we'll swap over the others into it. Until then I figure we're best off turning allegra back into Ubuntu12 to avoid a repeat of this kind of conflict. Once the workstations in the LLO control room are running smoothly on a new OS for a year, we can transfer into that. I don't think any of us wants to be the CDS beta tester for DV or DTT.

  12798   Sat Feb 4 12:20:39 2017 jamieSummaryCDS/cvs/cds/caltech/chans back on svn1.6
Quote:

True - its an issue. Koji and I are updating zita into Ubuntu16 LTS. If it looks like its OK with various tools we'll swap over the others into it. Until then I figure we're best off turning allegra back into Ubuntu12 to avoid a repeat of this kind of conflict. Once the workstations in the LLO control room are running smoothly on a new OS for a year, we can transfer into that. I don't think any of us wants to be the CDS beta tester for DV or DTT.

Just to be clear, since there seems to be some confusion, the SVN issue has nothing to do with Debian vs. Ubuntu.  SVN made non-backwards compatible changes to their working copy data format that breaks newer checkouts with older clients.  You will run into the exact same problem with newer Ubuntu versions.

I recommend the 40m start moving towards the reference operating systems (Debian 8 or SL7) as that's where CDS is moving.  By moving to newer Ubuntu versions you're moving away from CDS support, not towards it.

  12799   Sat Feb 4 12:29:20 2017 jamieSummaryCDS/cvs/cds/caltech/chans back on svn1.6

No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.

Quote:
Quote:

True - its an issue. Koji and I are updating zita into Ubuntu16 LTS. If it looks like its OK with various tools we'll swap over the others into it. Until then I figure we're best off turning allegra back into Ubuntu12 to avoid a repeat of this kind of conflict. Once the workstations in the LLO control room are running smoothly on a new OS for a year, we can transfer into that. I don't think any of us wants to be the CDS beta tester for DV or DTT.

Just to be clear, since there seems to be some confusion, the SVN issue has nothing to do with Debian vs. Ubuntu.  SVN made non-backwards compatible changes to their working copy data format that breaks newer checkouts with older clients.  You will run into the exact same problem with newer Ubuntu versions.

I recommend the 40m start moving towards the reference operating systems (Debian 8 or SL7) as that's where CDS is moving.  By moving to newer Ubuntu versions you're moving away from CDS support, not towards it.

 

  12800   Sat Feb 4 12:50:01 2017 jamieSummaryCDS/cvs/cds/caltech/chans back on svn1.6
Quote:

No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.

Ubuntu16 is not to my knowledge used for any CDS system anywhere.  I'm not sure how you expect to have better support for that.  There are no pre-compiled packages of any kind available for Ubuntu16.  Good luck, you big smelly doofuses. Nyah, nyah, nyah.

  12803   Mon Feb 6 15:18:08 2017 gautamUpdateCDSslow machine bootfest

Had to reboot c1psl, c1susaux, c1auxex, c1auxey and c1iscaux today. PMC has been relocked. ITMX didn't get stuck. According to this thread, there have been two instances in the last 10 days in which c1psl and c1susaux have failed. Since we seem to be doing this often lately, I've made a little script that uses the netcat utility to check which slow machines respond to telnet, it is located at /opt/rtcds/caltech/c1/scripts/cds/testSlowMachines.bash.

The script can be executed by ./testSlowMachines.bash.

  12810   Tue Feb 7 19:14:59 2017 JohannesUpdateCDSvme crate backplane adapter board layout

After fighting with Altium for what seems like an eternity I have finished putting my vision of the vme crate backplane adapter board into an electronic format. It is dimensioned to fill the back space of the crate exactly. The connectors are panel mount and the PCB attaches to the connectors with screws, such that the whole thing will be mechanically much more stable than the current configuration. A mounting bracket will attach to horizontal struts that need to be installed in the crates, mechanical drawings to follow.

  12893   Mon Mar 20 11:18:58 2017 gautamUpdateCDSNo internet connectivity on control room machines

There is no internet connectivity on any of the control room machines. 

I have been trying to debug by tracing the cabling situation in the rack in the office area, and will update if/when this problem has been resolved. I had last come into the lab on Saturday and there was no problem then. There 40m wireless network servicing the office area seems to work fine.

 

  12894   Mon Mar 20 14:39:44 2017 gautamUpdateCDSNo internet connectivity on control room machines

Koji diagnosed that the NAT router was to blame for this problem. I simply power cycled this router, and now the connectivity has been restored. 

It was possible to log into nodus and then to pianosa - and it was also possible to log into the various control room machines once logged into nodus. However, the outward packets seemed to not get transmitted. Anyways, power cycling the NAT Router unit seems to have done the job.

Quote:

There is no internet connectivity on any of the control room machines. 

I have been trying to debug by tracing the cabling situation in the rack in the office area, and will update if/when this problem has been resolved. I had last come into the lab on Saturday and there was no problem then. There 40m wireless network servicing the office area seems to work fine.

 

 

  12914   Tue Mar 28 21:06:53 2017 ranaSummaryCDS/cvs/cds/caltech/chans back on svn1.6

Debian doesn't like EPICS. Or our XY plots of beam spots...Sad!

Quote:
Quote:

No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.

Ubuntu16 is not to my knowledge used for any CDS system anywhere.  I'm not sure how you expect to have better support for that.  There are no pre-compiled packages of any kind available for Ubuntu16.  Good luck, you big smelly doofuses. Nyah, nyah, nyah.

  12924   Mon Apr 3 17:09:47 2017 gautamUpdateCDSC1PSL burt-restored

When I came in this morning, Steve had re-locked the PMC and IMC - but I could see a ~1Hz intensity fluctuation on the PMC REFL video monitor. I unlocked the PMC and tried to re-lock it, but couldn't using the usual prescription of turning the servo gain down and moving the DC bias slider around. I checked the status of the slow machines - all were responding to pings and could be telnet'ed into, so that didn't seem to be the problem. In the past, this sort of behaviour was characteristic of the infamous "sticky slider" problem - so I simply burt-restored c1psl using a snapshot from 29 March, after which I could easily re-lock the PMC. The transmitted light level looked normal on the scope on the PSL table, and the PMC REFL video monitor also look normal now.

  12980   Wed May 10 12:37:41 2017 gautamUpdateCDSMCautolocker dead

The MCautolocker had stalled - there were no additional lines to the logfile after 12:17pm (~20mins ago). Normally, it suffices to ssh into megatron and run sudo initctl restart MCautolocker - but it seems that there was no running initctl instance of this, so I had to run sudo initctl start MCautolocker. The FSS Slow control initctl process also seemed to have been terminated, so I ran sudo initctl start FSSslowPy.

It is not clear to me why the initctl instances got killed in the first place, but MC locks fine now.

  12982   Wed May 10 16:57:52 2017 ranaUpdateCDSMCautolocker dead

I rebooted megatron around 12:20 today. It had dozens of stalled medm process (some of them there since February!). I couldn't kill them without them coming back like zombies, so I did sudo reboot.

  12991   Mon May 15 08:26:43 2017 ranaUpdateCDSSVN up in userapps/cds

I did an 'svn update' in userapps/cds/ which pulled in some changes from the sites as well as various CDS utilities in common/ and utilities/

This was to get Keith Thorne's get_data.m and get_data2.m scripts which I tested and they seem to be able to get data. No success with getting minute trend yet, but that may be a user error.

Update Monday 15-May: Our version of NDS client is 0.10 and we need to have 0.14 for this new method to work. Ubuntu12 lscsoft repo doesn't have newer nds client so we'll have to upgrade some OS.

  13012   Thu May 25 12:22:59 2017 gautamUpdateCDSslow machine bootfest

After ~3months without any problems on the slow machine front, I had to reboot c1psl, c1susaux and c1iscaux today. The control room StripTool traces were not being displayed for all the PSL channels so I ran testSlowMachines.bash to check the status of the slow machines, which indicated that these three slow machines were dead. After rebooting the slow machines, I had to burt-restore the c1psl snapshot as usual to get the PMC to lock. Now, both PMC and IMC are locked. I also had to restart the StripTool traces (using scripts/general/startStrip.sh) to get the unresponsive traces back online.

Steve tells me that we probably have to do a reboot of the vacuum slow machines sometime soon too, as the MEDM screen for the Vacuum indicator channels are unresponsive.

Quote:

Had to reboot c1psl, c1susaux, c1auxex, c1auxey and c1iscaux today. PMC has been relocked. ITMX didn't get stuck. According to this thread, there have been two instances in the last 10 days in which c1psl and c1susaux have failed. Since we seem to be doing this often lately, I've made a little script that uses the netcat utility to check which slow machines respond to telnet, it is located at /opt/rtcds/caltech/c1/scripts/cds/testSlowMachines.bash.

 

 

  13028   Thu Jun 1 15:37:01 2017 gautamUpdateCDSslow machine bootfest

Steve alerted me that the IMC wouldn't lock. Reboots for c1susaux, c1iool0 today. I tried using the reset button instead of keying the crates. This worked for c1iool0, but not for c1susaux. So I had to key the latter crate. The machine took a good 5-10 minutes before coming back up, but eventually it did. Now IMC locks fine.

  13059   Mon Jun 12 10:34:10 2017 gautamUpdateCDSslow machine bootfest

Reboots for c1susaux, c1iscaux, c1auxex today. I took this opportunity to squish the Sat. Box. Cabling for MC2 (both on the Sat box end and also the vacuum feedthrough) as some work has been recently ongoing there, maybe something got accidently jiggled during the process and was causing MC2 alignment to jump around.

Relocked PMC to offload some of the DC offset, and re-aligned IMC after c1susaux reboot. PMC and IMC transmission back to nominal levels now. Let's see if MC2 is better behaved after this sat. box. voodoo.

Interestingly, since Feb 6, there were no slow machine reboots for almost 3 months, while there have been three reboots in the last three weeks. Not sure what (if anything) to make of that.

  13069   Fri Jun 16 13:53:11 2017 gautamUpdateCDSslow machine bootfest

Reboots for c1psl, c1iool0, c1iscaux today. MC autolocker log was complaining that the C1:IOO-MC_AUTOLOCK_BEAT EPICS channel did not exist, and running the usual slow machine check script revealed that these three machines required reboots. PMC was relocked, IMC Autolocker was restarted on Megatron and everything seems fine now.

 

  13096   Wed Jul 5 16:09:34 2017 gautamUpdateCDSslow machine bootfest

Reboots for c1susaux, c1iscaux today.

 

  13125   Wed Jul 19 08:37:21 2017 JamieUpdateCDSUpdate on front-end/DAQ rebuild

After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ).  The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO.  We're trying to get the front ends working first, and will work on recovering daqd after.

Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image.  We setup fb1 as the new boot server, and were able to get front ends booting again.  Unfortunately, we've been having trouble running and building models, so something is still amis.  We've been taking a three-pronged approach to getting the front ends running:

  • /diskless/root.fb: This involves booting the front ends from the backup of the diskless root from fb.  Runs gentoo kernel 2.6.34.1.  This should correspond to the environment that all models were built and running against.  But something is missing in the configuration.  The front ends were also mounting /opt from fb, which included the dolphin drivers, and we don't have a copy of that, so models aren't loading or recompiling.
  • /diskless/root.x1boot: Keith sent a disk image of the entire x1boot server from LLO.  It uses gentoo kernel 3.0.8.  This ostensibly includes everything we should need to run the front ends, but it's unfortunately configured with newer versions of some of the software and also isn't loading our existing models or building new ones.  This also seems to be having issues with the dolphin drivers.
  • /diskless/root.jessie: This is an entirely new boot image build from scratch with Debian jessie, using an RTS-patched 3.2 kernel.  This would use the latest versions of everything.  It's mostly working, we just need to rebuild the dolphin driver and source.

It seems that in all cases we need to rebuild the dolphin drivers from source.

  13127   Wed Jul 19 14:26:50 2017 JamieUpdateCDSUpdate on front-end/DAQ rebuild

 

Quote:

After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ).  The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO.  We're trying to get the front ends working first, and will work on recovering daqd after.

Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image.  We setup fb1 as the new boot server, and were able to get front ends booting again.  Unfortunately, we've been having trouble running and building models, so something is still amis.  We've been taking a three-pronged approach to getting the front ends running:

  • /diskless/root.fb: This involves booting the front ends from the backup of the diskless root from fb.  Runs gentoo kernel 2.6.34.1.  This should correspond to the environment that all models were built and running against.  But something is missing in the configuration.  The front ends were also mounting /opt from fb, which included the dolphin drivers, and we don't have a copy of that, so models aren't loading or recompiling.
  • /diskless/root.x1boot: Keith sent a disk image of the entire x1boot server from LLO.  It uses gentoo kernel 3.0.8.  This ostensibly includes everything we should need to run the front ends, but it's unfortunately configured with newer versions of some of the software and also isn't loading our existing models or building new ones.  This also seems to be having issues with the dolphin drivers.
  • /diskless/root.jessie: This is an entirely new boot image build from scratch with Debian jessie, using an RTS-patched 3.2 kernel.  This would use the latest versions of everything.  It's mostly working, we just need to rebuild the dolphin driver and source.

It seems that in all cases we need to rebuild the dolphin drivers from source.

To clarify, we're able to boot the x1boot image with the existing 2.6.25 kernel that we have from fb.  The issue with the root.x1boot image is not the kernel version but some of the other support libraries, such as dolphin.

  13130   Fri Jul 21 18:03:17 2017 JamieUpdateCDSUpdate on front-end/DAQ rebuild

Update:

  • front ends booting with the new Debian jessie diskless root image and a linux 3.2 version of the RTS-patched kernel
  • dolphin is configured correctly and running on c1lsc and c1sus
  • models building and running with RCG 3.0.3

Up next:

  • add c1ioo to the dolphin network
  • recompile/restart all front end models
  • daqd

I'll try to get the first two of those done tomorrow, although it's unclear what model updates we'll have to do to get things working with the newer RCG.

 

  13133   Sun Jul 23 22:16:55 2017 Jamie, gautamUpdateCDSfront-end now running with new OS, RCG

All front ends and model are (mostly) running now

All suspensions are damped:

It should be possible at this point to do more recovery, like locking the MC.

Some details on the restore process:

  • all models were recompiled with the new RCG version 3.0.3
  • the new RCG does stricter simulink drawing checks, and was complaining about unterminated outputs in some of the SUS models.  Terminated all outputs it was concerned about and saved.
  • RCG 3.0 requires a new directory for doing better filter module diagnostics: /opt/rtcds/caltech/c1/chans/tmp
  • had to reset the slow machines c1susaux, c1auxex, c1auxey

The daqd is not yet running.  This is the next task.

I have been taking copious notes and will fully document the restore process once complete.

c1ioo issues

c1ioo has been giving us a little bit of trouble.  The c1ioo model kept crashing and taking down the whole c1ioo host.  We found a red light on one of the ADCs (ADC1).  We pulled the card and replaced it with a spare from the CDS cabinet.  That seemed to fix the problem and c1ioo became more stable.

We've still been seeing a lot of glitching in c1ioo, though, with CPU cycle times frequently (every couple of seconds) running above threshold for all models, up to 200 us.  I tried unloading every kernel module I could and shutting down every non-critical process, but nothing seemed to help.

We eventually tried stopping the c1ioo model altogether and that seemed to help quite a bit, dropping the long cycle rate down to something like one every 30 seconds or so.  Not sure what that means.  We should look into the BIOS again, to see if there could be something interacting with the newer kernel.

So currently the c1ioo model is not running (which is why it's all white in the CDS overview snapshot above).  The fact that c1ioo is not running and the remaining models are still occaissionly glitching is also causing various IPC errors on auxilliary models (see c1mcs, c1rfm, c1ass, c1asx). 

RCG compile warnings

the new RCG tries to do more checks on custom c code, but it seems to be having trouble finding our custom "ccodeio.h" files that live with the c definitions in USERAPPS/*/common/src/.  Unclear why yet.  This is causing the RCG to spit out warnings like the following:

Cannot verify the number of ins/outs for C function BLRMS.
    File is /opt/rtcds/userapps/release/cds/c1/src/BLRMSFILTER.c
    Please add file and function to CDS_SRC or CDS_IFO_SRC ccodeio.h file.

This are just warnings and will not prevent the model form compiling or warning.  We'll figure out what the problem is to make these go away, but they can be ignored for the time being.

model unload instability

Probably the worst problem we're facing right now is an instability that will occaissionally, but not always, cause the entire front end host to freeze up upon unloading an RTS kernel module.  This is a known issue with the newer linux kernels (we're using kernel version 3.2.35), and is being looked into.

This is particularly annoying with the machines on the dolphin network, since if one of the dolphin hosts goes down it manages to crash all the models reading from the dolphin network.  Since half the time they can't be cleanly restarted, this tends to cause a boot fest with c1sus, c1lsc, and c1ioo.  If this happens, just restart those machines, wait till they've all fully booted, then restart all the models on all hosts with "rtcds start all".

  13135   Mon Jul 24 10:45:23 2017 gautamUpdateCDSc1iscex models died

This morning, all the c1iscex models were dead. Attachment #1 shows the state of the cds overview screen when I came in. The machine itself was ssh-able, so I just restarted all the models and they came back online without fuss.

Quote:

All front ends and model are (mostly) running now

  13136   Mon Jul 24 10:59:08 2017 JamieUpdateCDSc1iscex models died
Quote:

This morning, all the c1iscex models were dead. Attachment #1 shows the state of the cds overview screen when I came in. The machine itself was ssh-able, so I just restarted all the models and they came back online without fuss.

This was me.  I had rebooted that machine and hadn't restarted the models.  Sorry for the confusion.

  13138   Mon Jul 24 19:28:55 2017 JamieUpdateCDSfront end MX stream network working, glitches in c1ioo fixed

MX/OpenMX network running

Today I got the mx/open-mx networking working for the front ends.  This required some tweaking to the network interface configuration for the diskless front ends, and recompiling mx and open-mx for the newer kernel.  Again, this will all be documented.

controls@fb1:~ 0$ /opt/mx/bin/mx_info
MX Version: 1.2.16
MX Build: root@fb1:/opt/src/mx-1.2.16 Mon Jul 24 11:33:57 PDT 2017
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
    8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0:  364.4 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
    Status:        Running, P0: Link Up
    Network:    Ethernet 10G

    MAC Address:    00:60:dd:43:74:62
    Product code:    10G-PCIE-8B-S
    Part number:    09-04228
    Serial number:    485052
    Mapper:        00:60:dd:43:74:62, version = 0x00000000, configured
    Mapped hosts:    6

                                                        ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                        P0
-----    -----------     ---------                        ---
   0) 00:60:dd:43:74:62 fb1:0                             1,0
   1) 00:30:48:be:11:5d c1iscex:0                         1,0
   2) 00:30:48:bf:69:4f c1lsc:0                           1,0
   3) 00:25:90:0d:75:bb c1sus:0                           1,0
   4) 00:30:48:d6:11:17 c1iscey:0                         1,0
   5) 00:14:4f:40:64:25 c1ioo:0                           1,0
controls@fb1:~ 0$

c1ioo timing glitches fixed

I also checked the BIOS on c1ioo and found that the serial port was enabled, which is known to cause timing glitches.  I turned off the serial port (and some power management stuff), and rebooted, and all the c1ioo timing glitches seem to have gone away.

It's unclear why this is a problem that's just showing up now.  Serial ports have always been a problem, so it seems unlikely this is just a problem with the newer kernel.  Could the BIOS have somehow been reset during the power glitch?

In any event, all the front ends are now booting cleanly, with all dolphin and mx networking coming up automatically, and all models running stably:

Now for daqd...

  13139   Mon Jul 24 19:57:54 2017 gautamUpdateCDSIMC locked, Autolocker re-enabled

Now that all the front end models are running, I re-aligned the IMC, locked it manually, and then tweaked the alignment some more. The IMC transmission now is hovering around 15300 counts. I re-enabled the Autolocker and FSS Slow loops on Megatron as well.

Quote:

MX/OpenMX network running

Today I got the mx/open-mx networking working for the front ends.  This required some tweaking to the network interface configuration for the diskless front ends, and recompiling mx and open-mx for the newer kernel.  Again, this will all be documented.

 

  13145   Wed Jul 26 19:13:07 2017 JamieUpdateCDSdaqd showing same instability as before

I recompiled daqd on the updated fb1, similar to how I had before, and we're seeing the same instability: process crashes when it tries to write out the second trend (technically it looks like it crashes while it's trying to write out the full frame while the second trend is also being written out).  Jonathan Hanks and I are actively looking into it and i'll provide further report soon.

  13149   Fri Jul 28 20:22:41 2017 JamieUpdateCDSpossible stable daqd configuration with separate DC and FW

This week Jonathan Hanks and I have been trying to diagnose why the daqd has been unstable in the configuration used by the 40m, with data concentrator (dc) and frame writer (fw) in the same process (referred to generically as 'fb').  Jonathan has been digging into the core dumps and source to try to figure out what's going on, but he hasn't come up with anything concrete yet.

As an alternative, we've started experimenting with a daqd configuration with the dc and fw components running in separate processes, with communication over the local loopback interface.  The separate dc/fw process model more closely matches the configuration at the sites, although the sites put dc and fwprocesses on different physical machines.  Our experimentation thus far seems to indicate that this configuration is stable, although we haven't yet tested it with the full configuration, which is what I'm attempting to do now.

Unfortunately I'm having trouble with the mx_stream communication between the front ends and the dc process.  The dc does not appear to be receiving the streams from the front ends and is producing a '0xbad' status message for each.  I'm investigating.

  13152   Mon Jul 31 15:13:24 2017 gautamUpdateCDSFB ---> FB1

[jamie, gautam]

In order to test the new daqd config that Jamie has been working on, we felt it would be most convenient for the host name "fb" (martian network IP 192.168.113.202) to point to the physical machine "fb1" (martian network IP 192.168.113.201).

I made this change in /var/lib/bind/martian.hosts on chiara, and then ran sudo service bind9 restart. It seems to have done the job. So as things stand, both hostnames "fb" and "fb1" point to 192.168.113.201.

Now, when starting up DTT or dataviewer, the NDS server is automatically found.

More details to follow.

  13153   Mon Jul 31 18:44:40 2017 JamieUpdateCDSCDS system essentially fully recovered

The CDS system is mostly fully recovered at this point.  The mx_streams are all flowing from all front ends, and from all models, and the daqd processes are receiving them and writing the data to frames:

Remaining unresolved issues:

  • IFO needs to be fully locked to make sure ALL components of all models are working.
  • The remaining red status lights are from the "FB NET" diagnostics, which are reflecting a missing status bit from the front end processes due to the fact that they were compiled with an earlier RCG version (3.0.3) than the mx_streams were (3.3+/trunk).  There will be a new release of the RTS soon, at which point we'll compile everything from the same version, which should get us all green again.
  • The entire system has been fully modernized, to the target CDS reference OS (Debian jessie) and more recent RCG versions.  The management of the various RTS components, both on the front ends and on fb, have as much as possible been updated to use the modern management tools (e.g. systemd, udev, etc.).  These changes need to be documented.  In particular...
  • The fb daqd process has been split into three separate components, a configuration that mirrors what is done at the sites and appears to be more stable: The "target" directory for all of these components is now:
    • daqd_dc: data concentrator (receives data from front ends)
    • daqd_fw: receives frames from dc and writes out full frames and second/minute trends
    • daqd_rcv: NDS1 server (raises test points and receives archive data from frames from 'nds' process)
    The "target" directory for all of these new components is:
    • /opt/rtcds/caltech/c1/target/daqd
    All of these processes are now managed under systemd supervision on fb, meaning the daqd restart procedure has changed.  This needs to be simplified and clarified.
  • Second trend frames are being written, but for some reason they're not accessible over NDS.
  • Have not had a chance to verify minute trend and raw minute trend writing yet.  Needs to be confirmed.
  • Get wiper script working on new fb.
  • Front end RTS kernel will occaissionally crash when the RTS modules are unloaded.  Keith Thorne apparently has a kernel version with a different set of patches from Gerrit Kuhn that does not have this problem.  Keith's kernel needs to be packaged and installed in the front end diskless root.
  • The models accessing the dolphin shared memory will ALL crash when one of the front end hosts on the dolphin network goes away.  This results in a boot fest of all the dolphin-enabled hosts.  Need to figure out what's going on there.
  • The RCG settings snapshotting has changed significantly in later RCG versions.  We need to make sure that all burt backup type stuff is still working correctly.
  • Restoration of /frames from old fb SCSI RAID?
  • Backup of entirety of fb1, including fb1 root (/) and front end diskless root (/diskless)
  • Full documentation of rebuild procedure from Jamie's notes.
ELOG V3.1.3-