40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 48 of 330  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  12613   Mon Nov 14 14:21:06 2016 gautamSummaryCDSReplacing DIMM on Optimus

I replaced the suspected faulty DIMM earlier today (actually I replaced a pair of them as per the Sun Fire X4600 manual). I did things in the following sequence, which was the recommended set of steps according to the maintenance manual and also the set of graphics on the top panel of the unit:

  1. Checked that Optimus was shut down
  2. Removed the power cables from the back to cut the standby power. Two of the fan units near the front of the chassis were displaying fault lights, perhaps this has been the case since the most recent power outage after which I did not reboot Optimus
  3. Took off the top cover, removed CPU 6 (labelled "G" in the unit). The manual recommends finding faulty DIMMs by looking for an LED that is supposed to indicate the location of the bad card, but I couldn't find any such LEDs in the unit we have, perhaps this is an addition to the newer modules?
  4. Replaced the topmost (w.r.t the orientation the CPU normally sits inside the chassis) DIMM card with one of the new ones Steve ordered
  5. Put everything back together, powered Optimus up again. Reboot went smoothly, fan unit fault lights which I mentioned earlier did not light up on the reboot so that doesn't look like an issue.

I then checked for memory errors using edac-utils, and over the last couple of hours, found no errors (corrected or otherwise, see Praful's earlier elog for the error messages that we were getting prior to the DIMM swap)- I guess we will need to monitor this for a while more before we can say that the issue has been resolved.

Looking at dmesg after the reboot, I noticed the following error messages (not related to the memory issue I think):

[   19.375865] k10temp 0000:00:18.3: unreliable CPU thermal sensor; monitoring disabled
[   19.375996] k10temp 0000:00:19.3: unreliable CPU thermal sensor; monitoring disabled
[   19.376234] k10temp 0000:00:1a.3: unreliable CPU thermal sensor; monitoring disabled
[   19.376362] k10temp 0000:00:1b.3: unreliable CPU thermal sensor; monitoring disabled
[   19.376673] k10temp 0000:00:1c.3: unreliable CPU thermal sensor; monitoring disabled
[   19.376816] k10temp 0000:00:1d.3: unreliable CPU thermal sensor; monitoring disabled
[   19.376960] k10temp 0000:00:1e.3: unreliable CPU thermal sensor; monitoring disabled
[   19.377152] k10temp 0000:00:1f.3: unreliable CPU thermal sensor; monitoring disabled

I wonder if this could explain why the fans on Optimus often go into overdrive and make a racket? For the moment, the fan volume seems normal, comparable to the other SunFire X4600s we have running like megatron and FB...

  12615   Mon Nov 14 19:32:51 2016 ranaSummaryCDSReplacing DIMM on Optimus

I did apt-get update and then apt-get upgrade on optimus. All systems are nominal.

  12632   Mon Nov 21 19:54:13 2016 JohannesUpdateCDSacromag chassis hooked up to PSL

[Lydia, Johannes]

We connected and powered up the Acromag chassis today. It lives in 1X4 and is powered by the Sorensen +20V power supply in 1X5 via the fuse rail on the side of 1X4. For this we had to branch off the 20V path to the dewhitening and anti-image filter crate of the c1:susaux driven SOS optics. After confirming that none of the daughter modules in the crate draw from the 20V line, we added a wire leading to a new fuse we added for this unit and ran a power cable from there.

The diagnostic connector of the PSL laser is now connected to the unit and a tmux session was created on megatron that interfaces with the chassis and broadcasts the EPICS channels. We need to watch out in the coming days for epics freezes/outages, as in the past these seemed to occur around the same times we were toying with the Acromags.

Quote:

We set up the chassis in 1X7 today. Steve is ordering a longer 25 pin cable to reach. Until then the PSL diagnostic channels will not be usable.

 

Attachment 1: acromag_chassis.jpg
acromag_chassis.jpg
  12649   Wed Nov 30 11:56:56 2016 LydiaUpdateCDSWiring for Acromag auxex replacement

I've attached a schematic for how we will connect the Acromag mosules to the slow channel I/O curently going to c1auxex. The following changes are made:

  • We are getting rid of the slow readbacks from the Anti-Image and Oplev boards, as Rana says they are unnnecessary.
  • The whitening switching for the QPD is currently done by a Contec "fast" binary I/O module, but can be managed by acromag instead. This alllows CAB_1Y9_34 to  be fed directly into the Acromag box since all of its connections can now be managed slow. 
  • There's no need to change the PD whitening scheme around (since the signals never get huge), so we can set those to always be on and then lose those Contec channels. This means all of the necessary pins on CAB_1Y9_10 can go to Acromag. 
  • All the other backplane cables go the the fast machines only. 

 

Attachment 1: auxex_acromag.pdf
auxex_acromag.pdf
  12651   Wed Nov 30 14:54:01 2016 JohannesUpdateCDSSlow machine replacement

I was talking with Larry yesterday, and he suggested the rack-mounted supermicro machines SYS-5017A-EP (~$400) or SYS-5018A-FTN4 (~$600) that he uses for moving data around in LIGO. They have 2 gigabit ethernet ports and can thus function as modbus gateways, conveniently placed in the rack close to the slow DAQ/DIO chassis and running some local ubuntu or other distro (I think Aidan uses CentOS in the PSL lab). These only have atom processors, which would be sufficient for the slow machine replacement, but there are many more powerful models with sometimes subtle differences. If we motion towards a more complete GigECam coverage in the lab it could be better to kill two birds with one stone and get something a little faster that can do the video capture/processing, since these machines will be distributed more or less strategically around the lab. Just a thought, as I have currently no clear idea what resources are required for this or how much we're throwing at this GigECam upgrade.

 

Quote:

I've attached a schematic for how we will connect the Acromag mosules to the slow channel I/O curently going to c1auxex. The following changes are made:

  • We are getting rid of the slow readbacks from the Anti-Image and Oplev boards, as Rana says they are unnnecessary.
  • The whitening switching for the QPD is currently done by a Contec "fast" binary I/O module, but can be managed by acromag instead. This alllows CAB_1Y9_34 to  be fed directly into the Acromag box since all of its connections can now be managed slow. 
  • There's no need to change the PD whitening scheme around (since the signals never get huge), so we can set those to always be on and then lose those Contec channels. This means all of the necessary pins on CAB_1Y9_10 can go to Acromag. 
  • All the other backplane cables go the the fast machines only. 

 

 

  12677   Wed Dec 14 19:16:57 2016 LydiaUpdateCDSAcromag Binary I/O testing

I looked into converting the QPD whitening switches for the X end to Acromag.

  • To test this out and be able to freely toggle filters without messing anything up, I added a temporary dummy cdsFiltCtrl module (ACROMAG_BIO_TEST) to the c1scx model.
  • The filters can be toggled from the automatically generated medm screen medm/c1scx/C1SCX_ACROMAG_BIO_TEST.adl
  • The control output of the dummy filter bank is sent to a channel named C1:SCX-ACROMAG_SWCTRL.
  • I was able to read in the appropriate bits from there and send them to the appropriate acromag channel using a calcout channel.
    • I couldn't get individual bo channels to work. This Acromag module is configured to write to 4 channels at a time, so I set that up with an analog output channel. The calcout channel shifts each relevant bit from C1:SCX-ACROMAG_SWCTRL to the right place for writing to the Acromag. 
  • I connected the Acromag XT1111 Binary I/O unit to a temporary power supply and verified that toggling the filters on and off changed the output appropriately. This is a sinking output model so the output pin is connected to the return if the switch is on. 

The plan from here:

  • Determine how to configure these outputs to be compatible with the QPD whitening board.
  • Modify the SUS PD whitening board to always use the analog filter and remove digital option in models.
  • Test DACs 
  • Verify that the QPD whitening gain switches aren't doing anything
  • Assemble new Acromag box for X end and connect to QPD whitening, SUS PD whitening and SOS driver boards
  12699   Tue Jan 10 16:20:11 2017 SteveUpdateCDSpower glitch......Raid is rebuilding

Jamie started the fm40m Raid rebuilding. It has been beeping since the power outage.

Summary pages have no reading since power glitch.

 

Attachment 1: rebuilding_in_progress.png
rebuilding_in_progress.png
  12700   Tue Jan 10 21:47:00 2017 ranaUpdateCDSpower glitch

Does "done" mean they are OK or they are somehow damaged? Do you mean the workstations or the front end machines?

The computers are all done.

megatron and optimus are not responding to ping commands or ssh -- please power them up if they are off; we need them to get data remotely

  12701   Tue Jan 10 22:55:43 2017 gautamUpdateCDSpower glitch - recovery steps

Here is a link to an elog with the steps I had to follow the last time there was a similar power glitch.

The RAID array restart was also done not too long ago, we should also do a data consistency check as detailed here, if not already..

If someone hasn't found the time to do this, I can take care of it tomorrow afternoon after I am back.

Quote:

Does "done" mean they are OK or they are somehow damaged? Do you mean the workstations or the front end machines?

The computers are all done.

megatron and optimus are not responding to ping commands or ssh -- please power them up if they are off; we need them to get data remotely

 

  12702   Wed Jan 11 16:35:03 2017 gautamUpdateCDSpower glitch - recovery progress

[lydia, ericq, gautam]

We set about following the instructions linked in the previous elog. A few notes/remarks:

  1. It is important to run the ntpdate commands before restarting the models. Sometimes, multiple restarts of the models were required to turn all the indicator blocks on the MEDM screen green.
  2. There was also an issue of multiple ntpd processes running on the same machine, which obviously caused all sorts of timing havoc. EricQ helped us diagnose and fix these. At the moment, all the lights are green on the CDS status MEDM screen
  3. On the hardware side, apart from the usual suspects of frontends/megatron/optimus/fb needing to be rebooted, I noticed that the ETMX OSEM lights were off on the control room monitors. Investigation pointed to the 2 20V sorensens at the X end outputting 0V, 0A after the power glitch. We turned down both dials, and then gradually ramped them up again. Both Sorensens now read +/-20V, 0.3A, which is in agreement with the label stuck onto them.
  4. Restarted MC autolocker and FSS Slow scripts on megatron. I have not yet looked at the status of the nds2 server on megatron.
  5. 11 MHz Marconi has yet to be restarted - but I am unable to get even the IMC locked at the moment. For some reason, the RMS of the MC1 and MC3 coils are way higher than what I am used to seeing (~5mV rms as compared to the <1mV rms I am used to seeing for a damped optic). I will investigate further. Leaving MC autolocker disabled for now.
  12708   Thu Jan 12 17:31:51 2017 gautamUpdateCDSDC errors

The IFO is more or less back to an operational state. Some details:

  1. The IMC mirror excess motion alluded to in the previous elog was due to some timing issues on c1sus. The "DAC" and "DK" blocks in the c1x02 diag word were red instead of green. Restarting all the models on c1sus fixed the problem
  2. When c1ioo was restarted, all of Koji's changes (digital) to the MC WFS servo where lost as they were not committed to the SDF. Eric suggested that I could just restore them from burt snapshots, which is what I did. I used the c1iooepics.snap file from 12:19PM PST on 26 December 2016, which was a time when the WFS servo was working well as per this elog by Koji. I have also committed all the changes to the SDF. IMC alignment has been stable for the last 4 hours.
  3. Johannes aligned and locked the arms today. There was a large DC offset on POX11, which was zeroed out by closing the PSL shutter and running LSC offsets. Both arms lock and stay aligned now.
  4. The doubling oven controller at the Y end was switched off. Johannes turned it on.
  5. Eric and I started a data consistency check on the RAID array yesterday, it has completed today and indicated no issues
  6. NDS2 is now running again on megatron so channel access from outside should(???) be possible again.

One error persists - the "DC" indicator (data concentrator?) on the CDS medm screen for the various models spontaneously go red and return to green often. Is this a known issue with an easy fix?

  12715   Fri Jan 13 21:41:23 2017 KojiUpdateCDSDC errors

I think I fixed the DC error issue

1. I added the leap second (leapsecond ?) entry for 2016/12/31, 23:60:00 UTC to daqdrc


[OLD]
set gps_leaps = 820108813 914803214 1119744016;
[NEW]
set gps_leaps = 820108813 914803214 1119744016 1167264018;

2. Restarted FB and all realtime models

Now I don't see any RED light.

  12727   Tue Jan 17 20:47:23 2017 ranaUpdateCDSSimulink Webview updated

Seems like this stops working every ~2 years. Its been busted since early 2016 according to cron, so I fixed up the paths and restored some missing files and committed things to the SVN (with comments!) and now its working and grabbing the Web viewable versions of the front end models. Just need to restore its viewability and then the world can watch our models any time.

Quote:

Back in 2011, JoeB wrote some entries on how to automatically update the Simulink webview stuff.

Somehow, the cron broke down over the years. I reran the matlab file by hand today and it worked fine, so now you can see the up to date models using the internet.

https://nodus.ligo.caltech.edu:30889/FE/

 

  12754   Wed Jan 25 14:30:20 2017 gautamUpdateCDSslow machine bootfest

[gautam, lydia]

We rebooted c1psl, c1iscaux and c1aux which were all showing the typical symptom of responding to ping but not to telnet (and also blanked out epics fields on the MEDM screens). Keyed all these crates.

Restored burt snapshots for c1psl, PMC locked fine, and IMC is also locked now.

Johannes forgot to elog this yesterday, but he rebooted c1susaux following the usual procedure to avoid getting ITMX stuck. 

 

  12762   Fri Jan 27 17:07:52 2017 LydiaUpdateCDSslow machine bootfest

Rebooted c1iscaux, c1auxex and c1auxey which were all not reponding to telnet. The watchdogs for the ETMs were turned off and then I keyed all 3 crates. All slow machines are reponding to telnet now. Both green lasers locked to the arms so I didn't do any burt restore.

  12763   Fri Jan 27 17:49:41 2017 jamieUpdateCDStest of new daqd code on fb1

Just FYI I'm running a test of updated daqd code on fb1. 

fb1 has it's own fiber to the daq network switch, so nothing had to be modified to do this test. This *should* not affect anything in the rest of the system, but as we all know these are famous last words....  If something is going haywire, and you can't get in touch with me and can't figure what else to do, you can just log on to fb1 and shut it down.  It's not writing any data to any of the network filesystems.

The daqd code under test is from the latest advLigoRTS 3.2.1 tag, which has daqd stability fixes that will hopefully address the problems we were seeing last time I tried this upgrade.  We'll see...

I'm going to let it run over the weekend, and will check in periodically.

  12765   Fri Jan 27 20:52:36 2017 gautamUpdateCDStest of new daqd code on fb1

I'm not sure if this is related, but since today morning, I've noticed that the data concentrator errors have returned. Looking at daqd.log, there is a 1 second timing mismatch error that is being generated. Usually, manually running ntpdate on the front ends fixes this problem, but it did not work today.

Attachment 1: DCerrors.png
DCerrors.png
  12766   Fri Jan 27 21:21:35 2017 gautamUpdateCDSc1pem revamped

The coil and PD BLRMS are useful tools in identifying when glitches occur in the PD  readout, I thought it would be good to install them for ITMY, ETMX and SRM (since I plan to switch the MC3 satellite box, which we suspect to be problematic, with the SRM one). For this purpose, I had to install some IPC SHMEM blocks in C1SUS and recompile. 24 IPC channels were added to pipe the coil, PD and Oplev signals from C1SUS to C1PEM - the recompilation went smoothly, and it doesn't look like the model computation time has increased significantly or that the model is any closer to timing out.

However, I was unable to install the BLRMS blocks in C1PEM, as when I tried to compile the model with BLRMS for these extra 24 channels, I got a compilation error saying that I have exceeded the maximum allowed 499 testpoints per channel. Is there any workaround to this? It would be possible to create a custom BLRMS block that doesn't have all those testpoints, maybe this is the way to go? Especially if we want to install these channels for all our SOS optics, and also replace the current Seismic BLRMS with this scheme for consistency?

GV edit: I have implemented this scheme - after backing up the original BLRMS_2k part, I made a new one with no testpoints and only EPICS readouts. Doing so allowed me to recompile c1pem without any issues, the CPU time seems to have gone up by 3us from ~55us to 58us. So the BLRMS data record is only available at 16Hz, since there are no DQ channels in the BRLMS block - do we want these in any case? Let's see how this does over the weekend...

  12769   Sat Jan 28 12:05:57 2017 jamieUpdateCDStest of new daqd code on fb1
Quote:

I'm not sure if this is related, but since today morning, I've noticed that the data concentrator errors have returned. Looking at daqd.log, there is a 1 second timing mismatch error that is being generated. Usually, manually running ntpdate on the front ends fixes this problem, but it did not work today.

If this problem started before ~4pm on Friday then it's probably unrelated, since I didn't start any of these tests until after that.  If unexplained problem persist then we can try shutting of the fb1 daqd and see if that helps.

  12770   Mon Jan 30 18:41:41 2017 jamieUpdateCDSTEST ABORTED of new daqd code on fb1

I just aborted the fb1 test and reverted everything to the nominal configuration.  Everything looks to be operating nominally.  Front ends are mostly green except for c1rfm and c1asx which are currently not being acquired by the DAQ, and an unknown IPC error with c1daf.  Please let me know if any unusual problems are encountered.

The behavior of daqd on fb1 with the latest release (3.2.1) was not improved.  After turning on the full pipe it was back to crashing every 10 minutes or so when the full and second trend frames were being written out.  lame.  back to the drawing board...

  12776   Tue Jan 31 15:08:13 2017 ericqMetaphysicsCDSMinute Trend Koan

A novice was learning at the feet of Master Daqd. At the end of the lesson he looked through his notes and said, “Master, I have a few questions. May I ask them?”

Master Daqd nodded.

"Do we record minute trends of our data?"

"Yes, we record raw minute trends in /frames/trend/minute_raw"

"I see. Do we back up minute trends?"

"Yes, we back up all frames present in /frames/trend/minute"

"Wait, this means we are not recording our current trends! What is the reason for the existence of seperate minute and minute_raw trends?

“The knowledge you seek can be answered only by the gods.”

"Can we resume recording the minute trends?"

Master Daqd nodded, turned, and threw himself off the railing, falling to his death on the rocks below.

Upon seeing this, the novice was enlightened. He proceeded to investigate how to convert raw minute trends to minute trends so that historical records could be preserved, and precisely when Master Daqd started throwing himself off the mountain when asked to record minute trends.

  12777   Tue Jan 31 17:28:36 2017 ranaSummaryCDSMinute Trend Koan

Someone installed "Debian" on allegra. Why? Dataviewer doesn't work on there. Is there some advantage to making this thing have a different OS than the others? Any objections to going back to Ubuntu12?

  12779   Tue Jan 31 20:25:26 2017 ericqSummaryCDSMinute Trend Koan
Quote:

Someone installed "Debian" on allegra. Why? Dataviewer doesn't work on there. Is there some advantage to making this thing have a different OS than the others? Any objections to going back to Ubuntu12?

My elog negligence punchcard is getting pretty full... It's pretty much for the same reason as using Debian for optimus; much of the workstation software is getting packaged for Debian, which could offload our need for setting things up in a custom 40m way. Hacking the debian-focused software.ligo.org repos into Ubuntu has caused me headaches in the past. Allegra wasn't being used often, so I figured it was a good test bed for trying things out.

The dataviewer issue was dataviewer's inability to pull the `fb` out of `fb:8088` in the NDSSERVER env variable. I made a quick fix for it in the dataviewer launching script, but there is probably a better way to do it.

  12781   Tue Jan 31 22:15:02 2017 JohannesUpdateCDSvme crate backplane adapter boards

I made a crude sketch for how Lydia and I envision the connector situation on the back of the vme crates to be solved. Essentially the side panels of each crate extend about 2" (52 mm) beyond the edge of the DIN connectors. This is plenty of space for a simple PCB board. The connector of choice is D-Sub. We can split the 64 used pins into 2x 37 D-Sub OR (2x25 pin + 1x15pin). The former has fewer cables, but a few excess unused leads. A quick google search showed me that it is much cheaper to get twisted pair cables for 15 and 25 pin D-Subs. From what I remember, the used pins on the DIN connectors are concentrated on the low numbers end and the high numbers end, so might not need the 'middle' connector in many cases if we decide to break it up into three. I have to check this with Lydia though.

The D-Sub connectors would be panel mounted, for which we need a narrow panel piece with dsub cutouts. We can run horizontal struts across the vme crate from side panel to side panel. This way the force upon cable (dis)connection is mostly on the panel which is attached to the struts which are attached to the crate. This will also prevent gravitational sag or cable strain from pulling on the DIN connection, and we can use twisted pair cables with backshell, screws, and strain reliefs.

I was lookng into getting started with the PCB when Altium complained that the license is expired and to renew it. This is a relatively simple board layout so some free software out there is probably enough.

Attachment 1: vme_backplane_conn_sketch.jpg
vme_backplane_conn_sketch.jpg
  12791   Thu Feb 2 18:28:29 2017 ranaSummaryCDSMinute Trend Koan

and the song remains the same...

the version of SVN on these workstations is ahead of the one on the other workstations so now we can't do 'svn up' on any of the Ubuntu12 machines. One allegra and optimus I get this error:

controls@allegra|GWsummaries> svn up
Updating '.':
svn: E180001: Unable to connect to a repository at URL 'file:///cvs/cds/caltech/svn/trunk/GWsummaries'
svn: E180001: Unable to open an ra_local session to URL
svn: E180001: Unable to open repository 'file:///cvs/cds/caltech/svn/trunk/GWsummaries'

Quote:
Quote:

Someone installed "Debian" on allegra. Why? Dataviewer doesn't work on there. Is there some advantage to making this thing have a different OS than the others? Any objections to going back to Ubuntu12?

My elog negligence punchcard is getting pretty full... It's pretty much for the same reason as using Debian for optimus; much of the workstation software is getting packaged for Debian, which could offload our need for setting things up in a custom 40m way. Hacking the debian-focused software.ligo.org repos into Ubuntu has caused me headaches in the past. Allegra wasn't being used often, so I figured it was a good test bed for trying things out.

The dataviewer issue was dataviewer's inability to pull the `fb` out of `fb:8088` in the NDSSERVER env variable. I made a quick fix for it in the dataviewer launching script, but there is probably a better way to do it.

I'm not sure if its possible to downgrade our chans repo back to the old one, but I highly recommend that no one do 'svn upgrade' in any of our repos until we remove all of the Debian installs in the 40m lab or hire a full-time sysadmin.

  12794   Fri Feb 3 11:03:06 2017 jamieUpdateCDSmore testing fb1; DAQ DOWN DURING TEST

More testing of fb1 today.  DAQ DOWN UNTIL FURTHER NOTICE.

Testing Wednesday did not resolve anything, but Jonathan Hanks is helping.

  12796   Fri Feb 3 11:40:34 2017 ericqSummaryCDS/cvs/cds/caltech/chans back on svn1.6

I was able to bring back svn 1.6 formatting to /cvs/cds/caltech/chans by doing the following on nodus:

cd /cvs/cds/caltech
mkdir newchans
cd newchans
svn co https://nodus.ligo.caltech.edu:30889/svn/trunk/chans ./
rm -rf ../chans/.svn
mv ./.svn ../chans/

Note that I used the http address for the repository. The svn repository doesn't live at file:///cvs/cds/caltech/svn anymore; all of our checkouts (e.g. in the scripts directory) use http to get the one true repo location, regardless of where it lives on nodus' filesystem. (I suppose we could also use https://nodus.martian:30889/svn to stick to the local network, but I don't think we're that limited by the caltech network speed)

Presumably, at some point we will want to introduce a newer operating system into the 40m, as ubuntu 12.04 hits end-of-life in April 2017. Ubuntu 16.04 includes svn 1.8, so we'll also hit this issue if we choose that OS. 


Aside from the svn issues, this directory (/cvs/cds/caltech/chans) only contains pre-2010 channels. Filters and DAQ ini files currently live in /opt/rtcds/caltech/c1/chans, which is not under version control. It's also not clear to me why summary page configurations should be kept in this /cvs/cds place.

  12797   Sat Feb 4 12:00:59 2017 ranaSummaryCDS/cvs/cds/caltech/chans back on svn1.6

True - its an issue. Koji and I are updating zita into Ubuntu16 LTS. If it looks like its OK with various tools we'll swap over the others into it. Until then I figure we're best off turning allegra back into Ubuntu12 to avoid a repeat of this kind of conflict. Once the workstations in the LLO control room are running smoothly on a new OS for a year, we can transfer into that. I don't think any of us wants to be the CDS beta tester for DV or DTT.

  12798   Sat Feb 4 12:20:39 2017 jamieSummaryCDS/cvs/cds/caltech/chans back on svn1.6
Quote:

True - its an issue. Koji and I are updating zita into Ubuntu16 LTS. If it looks like its OK with various tools we'll swap over the others into it. Until then I figure we're best off turning allegra back into Ubuntu12 to avoid a repeat of this kind of conflict. Once the workstations in the LLO control room are running smoothly on a new OS for a year, we can transfer into that. I don't think any of us wants to be the CDS beta tester for DV or DTT.

Just to be clear, since there seems to be some confusion, the SVN issue has nothing to do with Debian vs. Ubuntu.  SVN made non-backwards compatible changes to their working copy data format that breaks newer checkouts with older clients.  You will run into the exact same problem with newer Ubuntu versions.

I recommend the 40m start moving towards the reference operating systems (Debian 8 or SL7) as that's where CDS is moving.  By moving to newer Ubuntu versions you're moving away from CDS support, not towards it.

  12799   Sat Feb 4 12:29:20 2017 jamieSummaryCDS/cvs/cds/caltech/chans back on svn1.6

No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.

Quote:
Quote:

True - its an issue. Koji and I are updating zita into Ubuntu16 LTS. If it looks like its OK with various tools we'll swap over the others into it. Until then I figure we're best off turning allegra back into Ubuntu12 to avoid a repeat of this kind of conflict. Once the workstations in the LLO control room are running smoothly on a new OS for a year, we can transfer into that. I don't think any of us wants to be the CDS beta tester for DV or DTT.

Just to be clear, since there seems to be some confusion, the SVN issue has nothing to do with Debian vs. Ubuntu.  SVN made non-backwards compatible changes to their working copy data format that breaks newer checkouts with older clients.  You will run into the exact same problem with newer Ubuntu versions.

I recommend the 40m start moving towards the reference operating systems (Debian 8 or SL7) as that's where CDS is moving.  By moving to newer Ubuntu versions you're moving away from CDS support, not towards it.

 

  12800   Sat Feb 4 12:50:01 2017 jamieSummaryCDS/cvs/cds/caltech/chans back on svn1.6
Quote:

No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.

Ubuntu16 is not to my knowledge used for any CDS system anywhere.  I'm not sure how you expect to have better support for that.  There are no pre-compiled packages of any kind available for Ubuntu16.  Good luck, you big smelly doofuses. Nyah, nyah, nyah.

  12803   Mon Feb 6 15:18:08 2017 gautamUpdateCDSslow machine bootfest

Had to reboot c1psl, c1susaux, c1auxex, c1auxey and c1iscaux today. PMC has been relocked. ITMX didn't get stuck. According to this thread, there have been two instances in the last 10 days in which c1psl and c1susaux have failed. Since we seem to be doing this often lately, I've made a little script that uses the netcat utility to check which slow machines respond to telnet, it is located at /opt/rtcds/caltech/c1/scripts/cds/testSlowMachines.bash.

The script can be executed by ./testSlowMachines.bash.

  12810   Tue Feb 7 19:14:59 2017 JohannesUpdateCDSvme crate backplane adapter board layout

After fighting with Altium for what seems like an eternity I have finished putting my vision of the vme crate backplane adapter board into an electronic format. It is dimensioned to fill the back space of the crate exactly. The connectors are panel mount and the PCB attaches to the connectors with screws, such that the whole thing will be mechanically much more stable than the current configuration. A mounting bracket will attach to horizontal struts that need to be installed in the crates, mechanical drawings to follow.

Attachment 1: vme_backplane.pdf
vme_backplane.pdf
  12893   Mon Mar 20 11:18:58 2017 gautamUpdateCDSNo internet connectivity on control room machines

There is no internet connectivity on any of the control room machines. 

I have been trying to debug by tracing the cabling situation in the rack in the office area, and will update if/when this problem has been resolved. I had last come into the lab on Saturday and there was no problem then. There 40m wireless network servicing the office area seems to work fine.

 

  12894   Mon Mar 20 14:39:44 2017 gautamUpdateCDSNo internet connectivity on control room machines

Koji diagnosed that the NAT router was to blame for this problem. I simply power cycled this router, and now the connectivity has been restored. 

It was possible to log into nodus and then to pianosa - and it was also possible to log into the various control room machines once logged into nodus. However, the outward packets seemed to not get transmitted. Anyways, power cycling the NAT Router unit seems to have done the job.

Quote:

There is no internet connectivity on any of the control room machines. 

I have been trying to debug by tracing the cabling situation in the rack in the office area, and will update if/when this problem has been resolved. I had last come into the lab on Saturday and there was no problem then. There 40m wireless network servicing the office area seems to work fine.

 

 

  12914   Tue Mar 28 21:06:53 2017 ranaSummaryCDS/cvs/cds/caltech/chans back on svn1.6

Debian doesn't like EPICS. Or our XY plots of beam spots...Sad!

Quote:
Quote:

No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.

Ubuntu16 is not to my knowledge used for any CDS system anywhere.  I'm not sure how you expect to have better support for that.  There are no pre-compiled packages of any kind available for Ubuntu16.  Good luck, you big smelly doofuses. Nyah, nyah, nyah.

  12924   Mon Apr 3 17:09:47 2017 gautamUpdateCDSC1PSL burt-restored

When I came in this morning, Steve had re-locked the PMC and IMC - but I could see a ~1Hz intensity fluctuation on the PMC REFL video monitor. I unlocked the PMC and tried to re-lock it, but couldn't using the usual prescription of turning the servo gain down and moving the DC bias slider around. I checked the status of the slow machines - all were responding to pings and could be telnet'ed into, so that didn't seem to be the problem. In the past, this sort of behaviour was characteristic of the infamous "sticky slider" problem - so I simply burt-restored c1psl using a snapshot from 29 March, after which I could easily re-lock the PMC. The transmitted light level looked normal on the scope on the PSL table, and the PMC REFL video monitor also look normal now.

  12980   Wed May 10 12:37:41 2017 gautamUpdateCDSMCautolocker dead

The MCautolocker had stalled - there were no additional lines to the logfile after 12:17pm (~20mins ago). Normally, it suffices to ssh into megatron and run sudo initctl restart MCautolocker - but it seems that there was no running initctl instance of this, so I had to run sudo initctl start MCautolocker. The FSS Slow control initctl process also seemed to have been terminated, so I ran sudo initctl start FSSslowPy.

It is not clear to me why the initctl instances got killed in the first place, but MC locks fine now.

  12982   Wed May 10 16:57:52 2017 ranaUpdateCDSMCautolocker dead

I rebooted megatron around 12:20 today. It had dozens of stalled medm process (some of them there since February!). I couldn't kill them without them coming back like zombies, so I did sudo reboot.

  12991   Mon May 15 08:26:43 2017 ranaUpdateCDSSVN up in userapps/cds

I did an 'svn update' in userapps/cds/ which pulled in some changes from the sites as well as various CDS utilities in common/ and utilities/

This was to get Keith Thorne's get_data.m and get_data2.m scripts which I tested and they seem to be able to get data. No success with getting minute trend yet, but that may be a user error.

Update Monday 15-May: Our version of NDS client is 0.10 and we need to have 0.14 for this new method to work. Ubuntu12 lscsoft repo doesn't have newer nds client so we'll have to upgrade some OS.

  13012   Thu May 25 12:22:59 2017 gautamUpdateCDSslow machine bootfest

After ~3months without any problems on the slow machine front, I had to reboot c1psl, c1susaux and c1iscaux today. The control room StripTool traces were not being displayed for all the PSL channels so I ran testSlowMachines.bash to check the status of the slow machines, which indicated that these three slow machines were dead. After rebooting the slow machines, I had to burt-restore the c1psl snapshot as usual to get the PMC to lock. Now, both PMC and IMC are locked. I also had to restart the StripTool traces (using scripts/general/startStrip.sh) to get the unresponsive traces back online.

Steve tells me that we probably have to do a reboot of the vacuum slow machines sometime soon too, as the MEDM screen for the Vacuum indicator channels are unresponsive.

Quote:

Had to reboot c1psl, c1susaux, c1auxex, c1auxey and c1iscaux today. PMC has been relocked. ITMX didn't get stuck. According to this thread, there have been two instances in the last 10 days in which c1psl and c1susaux have failed. Since we seem to be doing this often lately, I've made a little script that uses the netcat utility to check which slow machines respond to telnet, it is located at /opt/rtcds/caltech/c1/scripts/cds/testSlowMachines.bash.

 

 

  13028   Thu Jun 1 15:37:01 2017 gautamUpdateCDSslow machine bootfest

Steve alerted me that the IMC wouldn't lock. Reboots for c1susaux, c1iool0 today. I tried using the reset button instead of keying the crates. This worked for c1iool0, but not for c1susaux. So I had to key the latter crate. The machine took a good 5-10 minutes before coming back up, but eventually it did. Now IMC locks fine.

  13059   Mon Jun 12 10:34:10 2017 gautamUpdateCDSslow machine bootfest

Reboots for c1susaux, c1iscaux, c1auxex today. I took this opportunity to squish the Sat. Box. Cabling for MC2 (both on the Sat box end and also the vacuum feedthrough) as some work has been recently ongoing there, maybe something got accidently jiggled during the process and was causing MC2 alignment to jump around.

Relocked PMC to offload some of the DC offset, and re-aligned IMC after c1susaux reboot. PMC and IMC transmission back to nominal levels now. Let's see if MC2 is better behaved after this sat. box. voodoo.

Interestingly, since Feb 6, there were no slow machine reboots for almost 3 months, while there have been three reboots in the last three weeks. Not sure what (if anything) to make of that.

  13069   Fri Jun 16 13:53:11 2017 gautamUpdateCDSslow machine bootfest

Reboots for c1psl, c1iool0, c1iscaux today. MC autolocker log was complaining that the C1:IOO-MC_AUTOLOCK_BEAT EPICS channel did not exist, and running the usual slow machine check script revealed that these three machines required reboots. PMC was relocked, IMC Autolocker was restarted on Megatron and everything seems fine now.

 

  13096   Wed Jul 5 16:09:34 2017 gautamUpdateCDSslow machine bootfest

Reboots for c1susaux, c1iscaux today.

 

  13125   Wed Jul 19 08:37:21 2017 JamieUpdateCDSUpdate on front-end/DAQ rebuild

After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ).  The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO.  We're trying to get the front ends working first, and will work on recovering daqd after.

Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image.  We setup fb1 as the new boot server, and were able to get front ends booting again.  Unfortunately, we've been having trouble running and building models, so something is still amis.  We've been taking a three-pronged approach to getting the front ends running:

  • /diskless/root.fb: This involves booting the front ends from the backup of the diskless root from fb.  Runs gentoo kernel 2.6.34.1.  This should correspond to the environment that all models were built and running against.  But something is missing in the configuration.  The front ends were also mounting /opt from fb, which included the dolphin drivers, and we don't have a copy of that, so models aren't loading or recompiling.
  • /diskless/root.x1boot: Keith sent a disk image of the entire x1boot server from LLO.  It uses gentoo kernel 3.0.8.  This ostensibly includes everything we should need to run the front ends, but it's unfortunately configured with newer versions of some of the software and also isn't loading our existing models or building new ones.  This also seems to be having issues with the dolphin drivers.
  • /diskless/root.jessie: This is an entirely new boot image build from scratch with Debian jessie, using an RTS-patched 3.2 kernel.  This would use the latest versions of everything.  It's mostly working, we just need to rebuild the dolphin driver and source.

It seems that in all cases we need to rebuild the dolphin drivers from source.

  13127   Wed Jul 19 14:26:50 2017 JamieUpdateCDSUpdate on front-end/DAQ rebuild

 

Quote:

After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ).  The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO.  We're trying to get the front ends working first, and will work on recovering daqd after.

Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image.  We setup fb1 as the new boot server, and were able to get front ends booting again.  Unfortunately, we've been having trouble running and building models, so something is still amis.  We've been taking a three-pronged approach to getting the front ends running:

  • /diskless/root.fb: This involves booting the front ends from the backup of the diskless root from fb.  Runs gentoo kernel 2.6.34.1.  This should correspond to the environment that all models were built and running against.  But something is missing in the configuration.  The front ends were also mounting /opt from fb, which included the dolphin drivers, and we don't have a copy of that, so models aren't loading or recompiling.
  • /diskless/root.x1boot: Keith sent a disk image of the entire x1boot server from LLO.  It uses gentoo kernel 3.0.8.  This ostensibly includes everything we should need to run the front ends, but it's unfortunately configured with newer versions of some of the software and also isn't loading our existing models or building new ones.  This also seems to be having issues with the dolphin drivers.
  • /diskless/root.jessie: This is an entirely new boot image build from scratch with Debian jessie, using an RTS-patched 3.2 kernel.  This would use the latest versions of everything.  It's mostly working, we just need to rebuild the dolphin driver and source.

It seems that in all cases we need to rebuild the dolphin drivers from source.

To clarify, we're able to boot the x1boot image with the existing 2.6.25 kernel that we have from fb.  The issue with the root.x1boot image is not the kernel version but some of the other support libraries, such as dolphin.

  13130   Fri Jul 21 18:03:17 2017 JamieUpdateCDSUpdate on front-end/DAQ rebuild

Update:

  • front ends booting with the new Debian jessie diskless root image and a linux 3.2 version of the RTS-patched kernel
  • dolphin is configured correctly and running on c1lsc and c1sus
  • models building and running with RCG 3.0.3

Up next:

  • add c1ioo to the dolphin network
  • recompile/restart all front end models
  • daqd

I'll try to get the first two of those done tomorrow, although it's unclear what model updates we'll have to do to get things working with the newer RCG.

 

  13133   Sun Jul 23 22:16:55 2017 Jamie, gautamUpdateCDSfront-end now running with new OS, RCG

All front ends and model are (mostly) running now

All suspensions are damped:

It should be possible at this point to do more recovery, like locking the MC.

Some details on the restore process:

  • all models were recompiled with the new RCG version 3.0.3
  • the new RCG does stricter simulink drawing checks, and was complaining about unterminated outputs in some of the SUS models.  Terminated all outputs it was concerned about and saved.
  • RCG 3.0 requires a new directory for doing better filter module diagnostics: /opt/rtcds/caltech/c1/chans/tmp
  • had to reset the slow machines c1susaux, c1auxex, c1auxey

The daqd is not yet running.  This is the next task.

I have been taking copious notes and will fully document the restore process once complete.

c1ioo issues

c1ioo has been giving us a little bit of trouble.  The c1ioo model kept crashing and taking down the whole c1ioo host.  We found a red light on one of the ADCs (ADC1).  We pulled the card and replaced it with a spare from the CDS cabinet.  That seemed to fix the problem and c1ioo became more stable.

We've still been seeing a lot of glitching in c1ioo, though, with CPU cycle times frequently (every couple of seconds) running above threshold for all models, up to 200 us.  I tried unloading every kernel module I could and shutting down every non-critical process, but nothing seemed to help.

We eventually tried stopping the c1ioo model altogether and that seemed to help quite a bit, dropping the long cycle rate down to something like one every 30 seconds or so.  Not sure what that means.  We should look into the BIOS again, to see if there could be something interacting with the newer kernel.

So currently the c1ioo model is not running (which is why it's all white in the CDS overview snapshot above).  The fact that c1ioo is not running and the remaining models are still occaissionly glitching is also causing various IPC errors on auxilliary models (see c1mcs, c1rfm, c1ass, c1asx). 

RCG compile warnings

the new RCG tries to do more checks on custom c code, but it seems to be having trouble finding our custom "ccodeio.h" files that live with the c definitions in USERAPPS/*/common/src/.  Unclear why yet.  This is causing the RCG to spit out warnings like the following:

Cannot verify the number of ins/outs for C function BLRMS.
    File is /opt/rtcds/userapps/release/cds/c1/src/BLRMSFILTER.c
    Please add file and function to CDS_SRC or CDS_IFO_SRC ccodeio.h file.

This are just warnings and will not prevent the model form compiling or warning.  We'll figure out what the problem is to make these go away, but they can be ignored for the time being.

model unload instability

Probably the worst problem we're facing right now is an instability that will occaissionally, but not always, cause the entire front end host to freeze up upon unloading an RTS kernel module.  This is a known issue with the newer linux kernels (we're using kernel version 3.2.35), and is being looked into.

This is particularly annoying with the machines on the dolphin network, since if one of the dolphin hosts goes down it manages to crash all the models reading from the dolphin network.  Since half the time they can't be cleanly restarted, this tends to cause a boot fest with c1sus, c1lsc, and c1ioo.  If this happens, just restart those machines, wait till they've all fully booted, then restart all the models on all hosts with "rtcds start all".

  13135   Mon Jul 24 10:45:23 2017 gautamUpdateCDSc1iscex models died

This morning, all the c1iscex models were dead. Attachment #1 shows the state of the cds overview screen when I came in. The machine itself was ssh-able, so I just restarted all the models and they came back online without fuss.

Quote:

All front ends and model are (mostly) running now

Attachment 1: c1iscexFailure.png
c1iscexFailure.png
ELOG V3.1.3-