I replaced the suspected faulty DIMM earlier today (actually I replaced a pair of them as per the Sun Fire X4600 manual). I did things in the following sequence, which was the recommended set of steps according to the maintenance manual and also the set of graphics on the top panel of the unit:
I then checked for memory errors using edac-utils, and over the last couple of hours, found no errors (corrected or otherwise, see Praful's earlier elog for the error messages that we were getting prior to the DIMM swap)- I guess we will need to monitor this for a while more before we can say that the issue has been resolved.
Looking at dmesg after the reboot, I noticed the following error messages (not related to the memory issue I think):
[ 19.375865] k10temp 0000:00:18.3: unreliable CPU thermal sensor; monitoring disabled
[ 19.375996] k10temp 0000:00:19.3: unreliable CPU thermal sensor; monitoring disabled
[ 19.376234] k10temp 0000:00:1a.3: unreliable CPU thermal sensor; monitoring disabled
[ 19.376362] k10temp 0000:00:1b.3: unreliable CPU thermal sensor; monitoring disabled
[ 19.376673] k10temp 0000:00:1c.3: unreliable CPU thermal sensor; monitoring disabled
[ 19.376816] k10temp 0000:00:1d.3: unreliable CPU thermal sensor; monitoring disabled
[ 19.376960] k10temp 0000:00:1e.3: unreliable CPU thermal sensor; monitoring disabled
[ 19.377152] k10temp 0000:00:1f.3: unreliable CPU thermal sensor; monitoring disabled
I wonder if this could explain why the fans on Optimus often go into overdrive and make a racket? For the moment, the fan volume seems normal, comparable to the other SunFire X4600s we have running like megatron and FB...
I did apt-get update and then apt-get upgrade on optimus. All systems are nominal.
We connected and powered up the Acromag chassis today. It lives in 1X4 and is powered by the Sorensen +20V power supply in 1X5 via the fuse rail on the side of 1X4. For this we had to branch off the 20V path to the dewhitening and anti-image filter crate of the c1:susaux driven SOS optics. After confirming that none of the daughter modules in the crate draw from the 20V line, we added a wire leading to a new fuse we added for this unit and ran a power cable from there.
The diagnostic connector of the PSL laser is now connected to the unit and a tmux session was created on megatron that interfaces with the chassis and broadcasts the EPICS channels. We need to watch out in the coming days for epics freezes/outages, as in the past these seemed to occur around the same times we were toying with the Acromags.
We set up the chassis in 1X7 today. Steve is ordering a longer 25 pin cable to reach. Until then the PSL diagnostic channels will not be usable.
I've attached a schematic for how we will connect the Acromag mosules to the slow channel I/O curently going to c1auxex. The following changes are made:
I was talking with Larry yesterday, and he suggested the rack-mounted supermicro machines SYS-5017A-EP (~$400) or SYS-5018A-FTN4 (~$600) that he uses for moving data around in LIGO. They have ≥2 gigabit ethernet ports and can thus function as modbus gateways, conveniently placed in the rack close to the slow DAQ/DIO chassis and running some local ubuntu or other distro (I think Aidan uses CentOS in the PSL lab). These only have atom processors, which would be sufficient for the slow machine replacement, but there are many more powerful models with sometimes subtle differences. If we motion towards a more complete GigECam coverage in the lab it could be better to kill two birds with one stone and get something a little faster that can do the video capture/processing, since these machines will be distributed more or less strategically around the lab. Just a thought, as I have currently no clear idea what resources are required for this or how much we're throwing at this GigECam upgrade.
I looked into converting the QPD whitening switches for the X end to Acromag.
The plan from here:
Jamie started the fm40m Raid rebuilding. It has been beeping since the power outage.
Summary pages have no reading since power glitch.
Does "done" mean they are OK or they are somehow damaged? Do you mean the workstations or the front end machines?
The computers are all done.
megatron and optimus are not responding to ping commands or ssh -- please power them up if they are off; we need them to get data remotely
Here is a link to an elog with the steps I had to follow the last time there was a similar power glitch.
The RAID array restart was also done not too long ago, we should also do a data consistency check as detailed here, if not already..
If someone hasn't found the time to do this, I can take care of it tomorrow afternoon after I am back.
[lydia, ericq, gautam]
We set about following the instructions linked in the previous elog. A few notes/remarks:
The IFO is more or less back to an operational state. Some details:
One error persists - the "DC" indicator (data concentrator?) on the CDS medm screen for the various models spontaneously go red and return to green often. Is this a known issue with an easy fix?
I think I fixed the DC error issue
1. I added the leap second (leapsecond ?) entry for 2016/12/31, 23:60:00 UTC to daqdrc
set gps_leaps = 820108813 914803214 1119744016;
set gps_leaps = 820108813 914803214 1119744016 1167264018;
2. Restarted FB and all realtime models
Now I don't see any RED light.
Seems like this stops working every ~2 years. Its been busted since early 2016 according to cron, so I fixed up the paths and restored some missing files and committed things to the SVN (with comments!) and now its working and grabbing the Web viewable versions of the front end models. Just need to restore its viewability and then the world can watch our models any time.
Back in 2011, JoeB wrote some entries on how to automatically update the Simulink webview stuff.
Somehow, the cron broke down over the years. I reran the matlab file by hand today and it worked fine, so now you can see the up to date models using the internet.
We rebooted c1psl, c1iscaux and c1aux which were all showing the typical symptom of responding to ping but not to telnet (and also blanked out epics fields on the MEDM screens). Keyed all these crates.
Restored burt snapshots for c1psl, PMC locked fine, and IMC is also locked now.
Johannes forgot to elog this yesterday, but he rebooted c1susaux following the usual procedure to avoid getting ITMX stuck.
Rebooted c1iscaux, c1auxex and c1auxey which were all not reponding to telnet. The watchdogs for the ETMs were turned off and then I keyed all 3 crates. All slow machines are reponding to telnet now. Both green lasers locked to the arms so I didn't do any burt restore.
Just FYI I'm running a test of updated daqd code on fb1.
fb1 has it's own fiber to the daq network switch, so nothing had to be modified to do this test. This *should* not affect anything in the rest of the system, but as we all know these are famous last words.... If something is going haywire, and you can't get in touch with me and can't figure what else to do, you can just log on to fb1 and shut it down. It's not writing any data to any of the network filesystems.
The daqd code under test is from the latest advLigoRTS 3.2.1 tag, which has daqd stability fixes that will hopefully address the problems we were seeing last time I tried this upgrade. We'll see...
I'm going to let it run over the weekend, and will check in periodically.
I'm not sure if this is related, but since today morning, I've noticed that the data concentrator errors have returned. Looking at daqd.log, there is a 1 second timing mismatch error that is being generated. Usually, manually running ntpdate on the front ends fixes this problem, but it did not work today.
The coil and PD BLRMS are useful tools in identifying when glitches occur in the PD readout, I thought it would be good to install them for ITMY, ETMX and SRM (since I plan to switch the MC3 satellite box, which we suspect to be problematic, with the SRM one). For this purpose, I had to install some IPC SHMEM blocks in C1SUS and recompile. 24 IPC channels were added to pipe the coil, PD and Oplev signals from C1SUS to C1PEM - the recompilation went smoothly, and it doesn't look like the model computation time has increased significantly or that the model is any closer to timing out.
However, I was unable to install the BLRMS blocks in C1PEM, as when I tried to compile the model with BLRMS for these extra 24 channels, I got a compilation error saying that I have exceeded the maximum allowed 499 testpoints per channel. Is there any workaround to this? It would be possible to create a custom BLRMS block that doesn't have all those testpoints, maybe this is the way to go? Especially if we want to install these channels for all our SOS optics, and also replace the current Seismic BLRMS with this scheme for consistency?
GV edit: I have implemented this scheme - after backing up the original BLRMS_2k part, I made a new one with no testpoints and only EPICS readouts. Doing so allowed me to recompile c1pem without any issues, the CPU time seems to have gone up by 3us from ~55us to 58us. So the BLRMS data record is only available at 16Hz, since there are no DQ channels in the BRLMS block - do we want these in any case? Let's see how this does over the weekend...
If this problem started before ~4pm on Friday then it's probably unrelated, since I didn't start any of these tests until after that. If unexplained problem persist then we can try shutting of the fb1 daqd and see if that helps.
I just aborted the fb1 test and reverted everything to the nominal configuration. Everything looks to be operating nominally. Front ends are mostly green except for c1rfm and c1asx which are currently not being acquired by the DAQ, and an unknown IPC error with c1daf. Please let me know if any unusual problems are encountered.
The behavior of daqd on fb1 with the latest release (3.2.1) was not improved. After turning on the full pipe it was back to crashing every 10 minutes or so when the full and second trend frames were being written out. lame. back to the drawing board...
A novice was learning at the feet of Master Daqd. At the end of the lesson he looked through his notes and said, “Master, I have a few questions. May I ask them?”
Master Daqd nodded.
"Do we record minute trends of our data?"
"Yes, we record raw minute trends in /frames/trend/minute_raw"
"I see. Do we back up minute trends?"
"Yes, we back up all frames present in /frames/trend/minute"
"Wait, this means we are not recording our current trends! What is the reason for the existence of seperate minute and minute_raw trends?
“The knowledge you seek can be answered only by the gods.”
"Can we resume recording the minute trends?"
Master Daqd nodded, turned, and threw himself off the railing, falling to his death on the rocks below.
Upon seeing this, the novice was enlightened. He proceeded to investigate how to convert raw minute trends to minute trends so that historical records could be preserved, and precisely when Master Daqd started throwing himself off the mountain when asked to record minute trends.
Someone installed "Debian" on allegra. Why? Dataviewer doesn't work on there. Is there some advantage to making this thing have a different OS than the others? Any objections to going back to Ubuntu12?
My elog negligence punchcard is getting pretty full... It's pretty much for the same reason as using Debian for optimus; much of the workstation software is getting packaged for Debian, which could offload our need for setting things up in a custom 40m way. Hacking the debian-focused software.ligo.org repos into Ubuntu has caused me headaches in the past. Allegra wasn't being used often, so I figured it was a good test bed for trying things out.
The dataviewer issue was dataviewer's inability to pull the `fb` out of `fb:8088` in the NDSSERVER env variable. I made a quick fix for it in the dataviewer launching script, but there is probably a better way to do it.
I made a crude sketch for how Lydia and I envision the connector situation on the back of the vme crates to be solved. Essentially the side panels of each crate extend about 2" (52 mm) beyond the edge of the DIN connectors. This is plenty of space for a simple PCB board. The connector of choice is D-Sub. We can split the 64 used pins into 2x 37 D-Sub OR (2x25 pin + 1x15pin). The former has fewer cables, but a few excess unused leads. A quick google search showed me that it is much cheaper to get twisted pair cables for 15 and 25 pin D-Subs. From what I remember, the used pins on the DIN connectors are concentrated on the low numbers end and the high numbers end, so might not need the 'middle' connector in many cases if we decide to break it up into three. I have to check this with Lydia though.
The D-Sub connectors would be panel mounted, for which we need a narrow panel piece with dsub cutouts. We can run horizontal struts across the vme crate from side panel to side panel. This way the force upon cable (dis)connection is mostly on the panel which is attached to the struts which are attached to the crate. This will also prevent gravitational sag or cable strain from pulling on the DIN connection, and we can use twisted pair cables with backshell, screws, and strain reliefs.
I was lookng into getting started with the PCB when Altium complained that the license is expired and to renew it. This is a relatively simple board layout so some free software out there is probably enough.
and the song remains the same...
the version of SVN on these workstations is ahead of the one on the other workstations so now we can't do 'svn up' on any of the Ubuntu12 machines. One allegra and optimus I get this error:
controls@allegra|GWsummaries> svn up
svn: E180001: Unable to connect to a repository at URL 'file:///cvs/cds/caltech/svn/trunk/GWsummaries'
svn: E180001: Unable to open an ra_local session to URL
svn: E180001: Unable to open repository 'file:///cvs/cds/caltech/svn/trunk/GWsummaries'
I'm not sure if its possible to downgrade our chans repo back to the old one, but I highly recommend that no one do 'svn upgrade' in any of our repos until we remove all of the Debian installs in the 40m lab or hire a full-time sysadmin.
More testing of fb1 today. DAQ DOWN UNTIL FURTHER NOTICE.
Testing Wednesday did not resolve anything, but Jonathan Hanks is helping.
I was able to bring back svn 1.6 formatting to /cvs/cds/caltech/chans by doing the following on nodus:
svn co https://nodus.ligo.caltech.edu:30889/svn/trunk/chans ./
rm -rf ../chans/.svn
mv ./.svn ../chans/
Note that I used the http address for the repository. The svn repository doesn't live at file:///cvs/cds/caltech/svn anymore; all of our checkouts (e.g. in the scripts directory) use http to get the one true repo location, regardless of where it lives on nodus' filesystem. (I suppose we could also use https://nodus.martian:30889/svn to stick to the local network, but I don't think we're that limited by the caltech network speed)
Presumably, at some point we will want to introduce a newer operating system into the 40m, as ubuntu 12.04 hits end-of-life in April 2017. Ubuntu 16.04 includes svn 1.8, so we'll also hit this issue if we choose that OS.
Aside from the svn issues, this directory (/cvs/cds/caltech/chans) only contains pre-2010 channels. Filters and DAQ ini files currently live in /opt/rtcds/caltech/c1/chans, which is not under version control. It's also not clear to me why summary page configurations should be kept in this /cvs/cds place.
True - its an issue. Koji and I are updating zita into Ubuntu16 LTS. If it looks like its OK with various tools we'll swap over the others into it. Until then I figure we're best off turning allegra back into Ubuntu12 to avoid a repeat of this kind of conflict. Once the workstations in the LLO control room are running smoothly on a new OS for a year, we can transfer into that. I don't think any of us wants to be the CDS beta tester for DV or DTT.
Just to be clear, since there seems to be some confusion, the SVN issue has nothing to do with Debian vs. Ubuntu. SVN made non-backwards compatible changes to their working copy data format that breaks newer checkouts with older clients. You will run into the exact same problem with newer Ubuntu versions.
I recommend the 40m start moving towards the reference operating systems (Debian 8 or SL7) as that's where CDS is moving. By moving to newer Ubuntu versions you're moving away from CDS support, not towards it.
No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.
Ubuntu16 is not to my knowledge used for any CDS system anywhere. I'm not sure how you expect to have better support for that. There are no pre-compiled packages of any kind available for Ubuntu16. Good luck, you big smelly doofuses. Nyah, nyah, nyah.
Had to reboot c1psl, c1susaux, c1auxex, c1auxey and c1iscaux today. PMC has been relocked. ITMX didn't get stuck. According to this thread, there have been two instances in the last 10 days in which c1psl and c1susaux have failed. Since we seem to be doing this often lately, I've made a little script that uses the netcat utility to check which slow machines respond to telnet, it is located at /opt/rtcds/caltech/c1/scripts/cds/testSlowMachines.bash.
The script can be executed by ./testSlowMachines.bash.
After fighting with Altium for what seems like an eternity I have finished putting my vision of the vme crate backplane adapter board into an electronic format. It is dimensioned to fill the back space of the crate exactly. The connectors are panel mount and the PCB attaches to the connectors with screws, such that the whole thing will be mechanically much more stable than the current configuration. A mounting bracket will attach to horizontal struts that need to be installed in the crates, mechanical drawings to follow.
There is no internet connectivity on any of the control room machines.
I have been trying to debug by tracing the cabling situation in the rack in the office area, and will update if/when this problem has been resolved. I had last come into the lab on Saturday and there was no problem then. There 40m wireless network servicing the office area seems to work fine.
Koji diagnosed that the NAT router was to blame for this problem. I simply power cycled this router, and now the connectivity has been restored.
It was possible to log into nodus and then to pianosa - and it was also possible to log into the various control room machines once logged into nodus. However, the outward packets seemed to not get transmitted. Anyways, power cycling the NAT Router unit seems to have done the job.
Debian doesn't like EPICS. Or our XY plots of beam spots...Sad!
When I came in this morning, Steve had re-locked the PMC and IMC - but I could see a ~1Hz intensity fluctuation on the PMC REFL video monitor. I unlocked the PMC and tried to re-lock it, but couldn't using the usual prescription of turning the servo gain down and moving the DC bias slider around. I checked the status of the slow machines - all were responding to pings and could be telnet'ed into, so that didn't seem to be the problem. In the past, this sort of behaviour was characteristic of the infamous "sticky slider" problem - so I simply burt-restored c1psl using a snapshot from 29 March, after which I could easily re-lock the PMC. The transmitted light level looked normal on the scope on the PSL table, and the PMC REFL video monitor also look normal now.
The MCautolocker had stalled - there were no additional lines to the logfile after 12:17pm (~20mins ago). Normally, it suffices to ssh into megatron and run sudo initctl restart MCautolocker - but it seems that there was no running initctl instance of this, so I had to run sudo initctl start MCautolocker. The FSS Slow control initctl process also seemed to have been terminated, so I ran sudo initctl start FSSslowPy.
I rebooted megatron around 12:20 today. It had dozens of stalled medm process (some of them there since February!). I couldn't kill them without them coming back like zombies, so I did sudo reboot.
I did an 'svn update' in userapps/cds/ which pulled in some changes from the sites as well as various CDS utilities in common/ and utilities/
This was to get Keith Thorne's get_data.m and get_data2.m scripts which I tested and they seem to be able to get data. No success with getting minute trend yet, but that may be a user error.
Update Monday 15-May: Our version of NDS client is 0.10 and we need to have 0.14 for this new method to work. Ubuntu12 lscsoft repo doesn't have newer nds client so we'll have to upgrade some OS.
After ~3months without any problems on the slow machine front, I had to reboot c1psl, c1susaux and c1iscaux today. The control room StripTool traces were not being displayed for all the PSL channels so I ran testSlowMachines.bash to check the status of the slow machines, which indicated that these three slow machines were dead. After rebooting the slow machines, I had to burt-restore the c1psl snapshot as usual to get the PMC to lock. Now, both PMC and IMC are locked. I also had to restart the StripTool traces (using scripts/general/startStrip.sh) to get the unresponsive traces back online.
Steve tells me that we probably have to do a reboot of the vacuum slow machines sometime soon too, as the MEDM screen for the Vacuum indicator channels are unresponsive.
Steve alerted me that the IMC wouldn't lock. Reboots for c1susaux, c1iool0 today. I tried using the reset button instead of keying the crates. This worked for c1iool0, but not for c1susaux. So I had to key the latter crate. The machine took a good 5-10 minutes before coming back up, but eventually it did. Now IMC locks fine.
Reboots for c1susaux, c1iscaux, c1auxex today. I took this opportunity to squish the Sat. Box. Cabling for MC2 (both on the Sat box end and also the vacuum feedthrough) as some work has been recently ongoing there, maybe something got accidently jiggled during the process and was causing MC2 alignment to jump around.
Relocked PMC to offload some of the DC offset, and re-aligned IMC after c1susaux reboot. PMC and IMC transmission back to nominal levels now. Let's see if MC2 is better behaved after this sat. box. voodoo.
Interestingly, since Feb 6, there were no slow machine reboots for almost 3 months, while there have been three reboots in the last three weeks. Not sure what (if anything) to make of that.
Reboots for c1psl, c1iool0, c1iscaux today. MC autolocker log was complaining that the C1:IOO-MC_AUTOLOCK_BEAT EPICS channel did not exist, and running the usual slow machine check script revealed that these three machines required reboots. PMC was relocked, IMC Autolocker was restarted on Megatron and everything seems fine now.
Reboots for c1susaux, c1iscaux today.
After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ). The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO. We're trying to get the front ends working first, and will work on recovering daqd after.
Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image. We setup fb1 as the new boot server, and were able to get front ends booting again. Unfortunately, we've been having trouble running and building models, so something is still amis. We've been taking a three-pronged approach to getting the front ends running:
It seems that in all cases we need to rebuild the dolphin drivers from source.
To clarify, we're able to boot the x1boot image with the existing 2.6.25 kernel that we have from fb. The issue with the root.x1boot image is not the kernel version but some of the other support libraries, such as dolphin.
I'll try to get the first two of those done tomorrow, although it's unclear what model updates we'll have to do to get things working with the newer RCG.
All suspensions are damped:
It should be possible at this point to do more recovery, like locking the MC.
Some details on the restore process:
The daqd is not yet running. This is the next task.
I have been taking copious notes and will fully document the restore process once complete.
c1ioo has been giving us a little bit of trouble. The c1ioo model kept crashing and taking down the whole c1ioo host. We found a red light on one of the ADCs (ADC1). We pulled the card and replaced it with a spare from the CDS cabinet. That seemed to fix the problem and c1ioo became more stable.
We've still been seeing a lot of glitching in c1ioo, though, with CPU cycle times frequently (every couple of seconds) running above threshold for all models, up to 200 us. I tried unloading every kernel module I could and shutting down every non-critical process, but nothing seemed to help.
We eventually tried stopping the c1ioo model altogether and that seemed to help quite a bit, dropping the long cycle rate down to something like one every 30 seconds or so. Not sure what that means. We should look into the BIOS again, to see if there could be something interacting with the newer kernel.
So currently the c1ioo model is not running (which is why it's all white in the CDS overview snapshot above). The fact that c1ioo is not running and the remaining models are still occaissionly glitching is also causing various IPC errors on auxilliary models (see c1mcs, c1rfm, c1ass, c1asx).
the new RCG tries to do more checks on custom c code, but it seems to be having trouble finding our custom "ccodeio.h" files that live with the c definitions in USERAPPS/*/common/src/. Unclear why yet. This is causing the RCG to spit out warnings like the following:
Cannot verify the number of ins/outs for C function BLRMS.
File is /opt/rtcds/userapps/release/cds/c1/src/BLRMSFILTER.c
Please add file and function to CDS_SRC or CDS_IFO_SRC ccodeio.h file.
This are just warnings and will not prevent the model form compiling or warning. We'll figure out what the problem is to make these go away, but they can be ignored for the time being.
Probably the worst problem we're facing right now is an instability that will occaissionally, but not always, cause the entire front end host to freeze up upon unloading an RTS kernel module. This is a known issue with the newer linux kernels (we're using kernel version 3.2.35), and is being looked into.
This is particularly annoying with the machines on the dolphin network, since if one of the dolphin hosts goes down it manages to crash all the models reading from the dolphin network. Since half the time they can't be cleanly restarted, this tends to cause a boot fest with c1sus, c1lsc, and c1ioo. If this happens, just restart those machines, wait till they've all fully booted, then restart all the models on all hosts with "rtcds start all".
"rtcds start all"
This morning, all the c1iscex models were dead. Attachment #1 shows the state of the cds overview screen when I came in. The machine itself was ssh-able, so I just restarted all the models and they came back online without fuss.