I installed awgstream-2.16.14 in /ligo/apps/ubuntu12. As with all the ubuntu12 "packages", you need to source the ubuntu12 ligoapps environment script:
controls@pianosa|~ > . /ligo/apps/ubuntu12/ligoapps-user-env.sh
controls@pianosa|~ > which awgstream
I tested it on the SRM LSC filter bank. In one terminal I opened the following camonitor on C1:SUS-SRM_LSC_OUTMON. In another terminal I ran the following:
controls@pianosa|~ > seq 0 .1 16384 | awgstream C1:SUS-SRM_LSC_EXC 16384 -
Channel = C1:SUS-SRM_LSC_EXC
File = -
Scale = 1.000000
Start = 1092790384.000000
The camonitor output was:
controls@pianosa|~ > camonitor C1:SUS-SRM_LSC_OUTMON
C1:SUS-SRM_LSC_OUTMON 2014-08-22 17:44:50.997418 0
C1:SUS-SRM_LSC_OUTMON 2014-08-22 17:52:49.155525 218.8
C1:SUS-SRM_LSC_OUTMON 2014-08-22 17:52:49.393404 628.4
C1:SUS-SRM_LSC_OUTMON 2014-08-22 17:52:49.629822 935.6
C1:SUS-SRM_LSC_OUTMON 2014-08-22 17:52:58.210810 15066.8
C1:SUS-SRM_LSC_OUTMON 2014-08-22 17:52:58.489501 15476.4
C1:SUS-SRM_LSC_OUTMON 2014-08-22 17:52:58.747095 15886
C1:SUS-SRM_LSC_OUTMON 2014-08-22 17:52:59.011415 0
In other words, it seems to work.
Today the network connection of OTTAVIA was sporadic.
Then in the evening OTTAVIA lost completely it. I tried jiggle the cables to recover it, but in vain.
We wonder if the network card (on-board one) has an issue.
I would also suspect IP conflicts; I had temporarily put the iMac on the Ottavia IP wire a few weeks ago. Hopefully its not back on there.
I checked chiara's tables, all seemed fine. I switched ethernet cables from the black one labelled "allegra," which seemed maybe fragile, for the teal one that may have been chiara's old ethernet cable. It's back on the network now; hopefully it lasts.
After the Great Computer Meltdown of 2014, we forgot about poor c0rga, which is why the RGA hasn't been recording scans for the past several months (as Steve noted in elog 10548).
Q helped me remember how to fix it. We added 3 lines to its /etc/fstab file, so that it knows to mount from Chiara and not Linux1. We changed the resolv.conf file, and Q made some simlinks.
Steve and I ran ..../scripts/RGA/RGAset.py on c0rga to setup the RGA's settings after the power outage, and we're checking to make sure that the RGA will run right now, then we'll set it back to the usual daily 4am run via cron.
EDIT, JCD: Ran ..../scripts/RGA/RGAlogger.py, saw that it works and logs data again. Also, c0rga had a slightly off time, so I ran sudo ntpdate -b -s -u pool.ntp.org, and that fixed it.
sudo ntpdate -b -s -u pool.ntp.org
In all of the fstabs, we're using chiara's IP instead of name, so that if the nameserver part isn't working, we can still get the NFS mounts.
On control room computers, we mount the NFS through /etc/fstab having lines like:
192.168.113.104:/home/cds /cvs/cds nfs rw,bg 0 0
fb:/frames /frames nfs ro,bg 0 0
Then, things like /cvs/cds/foo are locally symlinked to /opt/foo
For the diskless machines, we edited the files in /diskless/root. On FB, /diskless/root/etc/fstab becomes
master:/diskless/root / nfs sync,hard,intr,rw,nolock,rsize=8192,wsize=8192 0 0
master:/usr /usr nfs sync,hard,intr,ro,nolock,rsize=8192,wsize=8192 0 0
master:/home /home nfs sync,hard,intr,rw,nolock,rsize=8192,wsize=8192 0 0
none /proc proc defaults 0 0
none /var/log tmpfs size=100m,rw 0 0
none /var/lib/init.d tmpfs size=100m,rw 0 0
none /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620 0 0
none /sys sysfs defaults 0 0
master:/opt /opt nfs async,hard,intr,rw,nolock 0 0
192.168.113.104:/home/cds/rtcds /opt/rtcds nfs nolock 0 0
192.168.113.104:/home/cds/rtapps /opt/rtapps nfs nolock 0 0
("master" is defined in /diskless/root/etc/hosts to be 192.168.113.202, which is fb's IP)
and /diskless/root/etc/resolv.conf becomes:
nameserver 192.168.113.104 #Chiara
I have brought back c1auxex and c1auxey. Hopefully this elog will have some more details to add to Rana's elog 10015, so that in the end, we have the whole process documented.
The old Dell computer was already in a Minicom session, so I didn't have to start that up - hopefully it's just as easy as opening the program.
I plugged the DB9-RJ45 cable into the top of the RJ45 jacks on the computers. Since the aux end station computers hadn't had their bootChanges done yet, the prompt was "VxWorks Boot" (or something like that). For a computer that was already configured, for example the psl machine, the prompt was "c1psl", the name of the machine. So, the indication that work needs to be done is either you get the Boot prompt, or the computer starts to hang while it's trying to load the operating system (since it's not where the computer expects it to be). If the computer is hanging, key the crate again to power cycle it. When it gets to the countdown that says "press any key to enter manual boot" or something like that, push some key. This will get you to the "VxWorks Boot" prompt.
Once you have this prompt, press "?" to get the boot help menu. Press "p" to print the current boot parameters (the same list of things that you see with the bootChange command when you telnet in). Press "c" to go line-by-line through the parameters with the option to change parameters. I discovered that you can just type what you want the parameter to be next to the old value, and that will change the value. (ex. "host name : linux1 chiara" will change the host name from the old value of linux1 to the new value that you just typed of chiara).
After changing the appropriate parameters (as with all the other slow computers, just the [host name] and the [host inet] parameters needed changing), key the crate one more time and let it boot. It should boot successfully, and when it has finished and given you the name for the prompt (ex. c1auxex), you can just pull out the RJ45 end of the cable from the computer, and move on to the next one.
Koji, Jenne and Steve
Preparation to reboot:
1, closed VA6, V5 disconnected cable to valves ( closed all annuloses )
2, closed V1, disconnected it and stopped Maglev rotation
3, closed V4, disconnected its cable
See Atm1, This set up is insured us so there can not be any accidental valve switching to vent the vacuum envelope if reboot-caos strikes.[moving=disconnected]
4, RESET c1Vac1 and c1Vac2 one by one and together. They both went at once. We did NOT power recycled.
Jenne entered the new "carma" words on the old Dell laptop and checked the good answers. The reboot was done.
Note: c1Vac1 green-RUN indicator LED is yellow. It is fine as yellow.
5, Checked and TOGGLED valve positions to be correct value ( We did not correct the the small turbo pumps monitor positions, but they were alive )
6, V4 was reconnected and opened. Maglev was started.
7, V1 cable reconnected and opened at full rotation speed of 560 Hz
8, V5 cable reconnected, valve opened..............VA6 cable connected and opened........
9, Vacuum Normal valve configuration was reached.
Yesterday's reboot was prepared as stated above with one difference.
c1Vac1 and c1Vac2 were DOWN before reset. The disconnected valves stayed closed (plus VC1) . This saved us, so the main volume was not vented.
All others OPENED. PR1 and PR2 rouphing pumps turned ON. Ion pumps gate valve opened too. The ion pumps did not matter either because they were pump down recently.
We'll have to rewrite how to reboot vacuum.
I think the daqd process isn't running on the frame builder.
I tried telnetting' to fb's port 8087 (telnet fb 8087) and typing "shutdown", but so far that is hanging and hasn't returned a prompt to me in the last few minutes. Also, if I do a "ps -ef | grep daqd" in another terminal, it hangs.
I wasn't sure if this was an ntp problem (although that has been indicated in the past by 1 red block, not 2 red blocks and a white one), so I did "sudo /etc/init.d/ntp-client restart", but that didn't make any change. I also did an mxstream restart just in case, but that didn't help either.
I can ssh to the frame builder, but I can't do another telnet (the first one is still hung). I get an error "telnet: Unable to connect to remote host: Invalid argument"
Thoughts and suggestions are welcome!
CPU load seems extremely high. You need to reboot it, I think
controls@fb /proc 0$ cat loadavg
36.85 30.52 22.66 1/163 19295
This CPU load may have been me deleting some old frame files, to see if that would allow daqd to come back to life.
Daqd was segfaulting, and behaving in a manner similar to what is described here: (stack exchange link). However, I couldn't kill or revive daqd, so I rebooted the FB.
Things seem ok for now...
The daqd process on the frame builder looks like it is segfaulting again. It restarts itself every few minutes.
The symptoms remind me of elog 9530, but /frames is only 93% full, so the cause must be different.
Did anyone do anything to the fb today? If you did, please post an elog to help point us in a direction for diagnostics.
Q!!!! Can you please help? I looked at the log files, but they are kind of mysterious to me - I can't really tell the difference between a current (bad) log file and an old (presumably fine) log file. (I looked at 3 or 4 random, old log files, and they're all different in some ways, so I don't know which errors and warnings are real, and which are to be ignored).
I've been trying to figure out why daqd keeps crashing, but nothing is fixed yet.
I commented out the line in /etc/inittab that runs daqd automatically, so I could run it manually. Each time I run it ( with ./daqd -c ./daqdrc while in c1/target/fb), it churns along fine for a little while, but eventually spits out something like:
[Thu Oct 16 11:43:54 2014] main profiler warning: 1 empty blocks in the buffer
[Thu Oct 16 11:43:55 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:56 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:57 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:58 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:59 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:44:00 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:44:01 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:44:02 2014] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1097520250 to 1097520257
FATAL: exception not rethrown
I looked for time disagreements between the FB and the frontends, but they all seem fine. Running ntpdate only corrected things by 5ms. However, looking through /var/log/messages on FB, I found that ntp claims to have corrected the FB's time by ~111600 seconds (~31 hours) when I rebooted it on Monday.
Maybe this has something to do with the timing that the FB is getting? The FE IOPs seem happy with their sync status, but I'm not personally currently aware of how the FB timing is set up.
On Monday, Jamie suggested checking out the situation with FB's RAID. Searching the elog for "empty blocks in the buffer" also brought up posts that mentioned problems with the RAID.
I went to the JetStor RAID web interface at http://192.168.113.119, and it reports everything as healthy; no major errors in the log. Looking at the SMART status of a few of the drives shows nothing out of the ordinary. The RAID is not mounted in read-only mode either, as was the problem mentioned in previous elogs.
I very tentatively declare that this particular daqd crapfest is "resolved" after Jenne rebooted fb and daqd has been running for about 40 minutes now without crapping itself. Wee hoo.
I spent a while yesterday trying to figure out what could have been going on. I couldn't find anything. I found an elog that said a previous daqd crapfest was finally only resolved by rebooting fb after a similar situation, i.e. there had been an issue that was resolved, daqd was still crapping itself, we couldn't figure out why so we just rebooted, daqd started working again.
So, in summary, totally unclear what the issue was, or why a reboot solved it, but there you go.
Looks like I spoke too soon. daqd seems to be crapping itself again:
controls@fb /opt/rtcds/caltech/c1/target/fb 0$ ls -ltr logs/old/ | tail -n 20
-rw-r--r-- 1 4294967294 4294967294 11244 Oct 17 11:34 daqd.log.1413570846
-rw-r--r-- 1 4294967294 4294967294 11086 Oct 17 11:36 daqd.log.1413570988
-rw-r--r-- 1 4294967294 4294967294 11244 Oct 17 11:38 daqd.log.1413571087
-rw-r--r-- 1 4294967294 4294967294 13377 Oct 17 11:43 daqd.log.1413571386
-rw-r--r-- 1 4294967294 4294967294 11481 Oct 17 11:45 daqd.log.1413571519
-rw-r--r-- 1 4294967294 4294967294 11985 Oct 17 11:47 daqd.log.1413571655
-rw-r--r-- 1 4294967294 4294967294 13219 Oct 17 13:00 daqd.log.1413576037
-rw-r--r-- 1 4294967294 4294967294 11150 Oct 17 14:00 daqd.log.1413579614
-rw-r--r-- 1 4294967294 4294967294 5127 Oct 17 14:07 daqd.log.1413580231
-rw-r--r-- 1 4294967294 4294967294 11165 Oct 17 14:13 daqd.log.1413580397
-rw-r--r-- 1 4294967294 4294967294 5440 Oct 17 14:20 daqd.log.1413580845
-rw-r--r-- 1 4294967294 4294967294 11352 Oct 17 14:25 daqd.log.1413581103
-rw-r--r-- 1 4294967294 4294967294 11359 Oct 17 14:28 daqd.log.1413581311
-rw-r--r-- 1 4294967294 4294967294 11195 Oct 17 14:31 daqd.log.1413581470
-rw-r--r-- 1 4294967294 4294967294 10852 Oct 17 15:45 daqd.log.1413585932
-rw-r--r-- 1 4294967294 4294967294 12696 Oct 17 16:00 daqd.log.1413586831
-rw-r--r-- 1 4294967294 4294967294 11086 Oct 17 16:02 daqd.log.1413586924
-rw-r--r-- 1 4294967294 4294967294 11165 Oct 17 16:05 daqd.log.1413587101
-rw-r--r-- 1 4294967294 4294967294 11086 Oct 17 16:21 daqd.log.1413588108
-rw-r--r-- 1 4294967294 4294967294 11097 Oct 17 16:25 daqd.log.1413588301
controls@fb /opt/rtcds/caltech/c1/target/fb 0$
The times all indicate when the daqd log was rotated, which happens everytime the process restarts. It doesn't seem to be happening so consistently, though. It's been 30 minutes since the last one. I wonder if it somehow correlated with actual interaction with the NDS process. Does some sort of data request cause it to crash?
Merging of threads.
ChrisW figured out that it looks like the problem with the frame builder is that it's having to wait for disk access. He has tweaked some things, and life has been soooo much better for Q and I this evening! See Chris' elog at elog 10632.
In the last few hours we've had 2 or maybe 3 times that I've had to reconnect Dataviewer to the framebuilder, which is a significant improvement over having to do it every few minutes.
Also, Rossa is having trouble with DTT today, starting sometime around dinnertime. Ottavia and Pianosa can do DTT things, but Rossa keeps getting "test timed out".
I'm not sure why, but c1iscex did not want to do an mxstream restart. It would complain at me that "* ERROR: mx_stream is already stopping."
Koji suggested that I reboot the machine, so I did. I turned off the ETMX watchdog, and then did a remote reboot. Everything came back nicely, and the mx_stream process seems to be running.
* ERROR: mx_stream is already stopping.
PRM, SRM and the ENDs are kicking up. Computers are down. PMC slider is stuck at low voltage.
Still not able to resolve the issue.
Except for c1lsc, the models are not running on any of the FE machines . I can ssh into all the machines but could not restart the models on the FE using the usual rtcds restart <modelname>
Something happened around 4AM (inferring from the Striptool on the wall) and the models have not been running since then.
Everything seems to be back up and running.
The computers weren't such a big problem (or at least didn't seem to be). I turned off the watchdogs, and remotely rebooted all of the computers (except for c1lsc, which Manasa already had gotten working). After this, I also ssh-ed to c1lsc and restarted all of the models, since half of them froze or something while the other computers were being power cycled.
However, this power cycling somehow completely screwed up the vertex suspensions. The MC suspensions were fine, and SRM was fine, but the ITMs, BS and PRM were not damping. To get them to kind of damp rather than ring up, we had to flip the signs on the pos and pit gains. Also, we were a little suspicious of potential channel-hopping, since touching one optic was occasionally time-coincident with another optic ringing up. So, no hard evidence on the channel hopping, but suspicions.
Anyhow, at some point I was concerned about the suspension slow computer, since the watchdogs weren't tripping even though the osem sensor rmses were well over the thresholds, so I keyed that crate. After this, the watchdogs tripped as expected when we enabled damping but the RMS was higher than the threshold.
I eventually remotely rebooted c1sus again. This totally fixed everything. We put all of the local damping gains back to the values that we found them (in particular, undoing our sign flips), and everything seems good again. I don't know what happened, but we're back online now.
Q notes that the bounce mode for at least ITMX (haven't checked the others) is rung up. We should check if it is starting to go down in a few hours.
Also, the FSS slow servo was not running, we restarted it on op340m.
[Jenne, Q, Diego]
I don't know why, but everything in EPICS-land froze for a few minutes just now. It happened yesterday that I saw, but I was bad and didn't elog it.
Anyhow, the arms stayed locked (on IR) for the whole time it was frozen, so the fast things must have still been working. We didn't see anything funny going on on the frame builder, although that shouldn't have much to do with the EPICS service. The seismic rainbow on the wall went to zeros during the freeze, although the MC and PSL strip charts are still fine.
After a few minutes, while we were still trying to think of things to check, things went back to normal. We're going to just keep locking for now....
[Jamie, EricQ, Jenne, Diego]
This is something that we discussed late Friday afternoon, but none of us remembered to elog.
We have been noticing that EPICS seems to run pretty slowly, and in fact twice last week froze for ~2 minutes or so (elog 10756).
On Friday, we plotted several traces on StripTool, such as C1:SUS-ETMY_QPD_SUM_OUTPUT and C1:SUS-ETMY_TRY_OUTPUT and C1:LSC-TRY_OUTPUT to see if channels with (supposedly) the same signal were seeing the same sample-holding. They were. The issue seems to be pretty wide spread, over all of the fast front ends. However, if we look at EPICS channels provided by the old analog computers, they do not seem to have this issue.
So, Jamie posits that perhaps we have a network switch somewhere that is connected to all of the fast front end computers, but not the old slow machines, and that this switch is starting to fail.
My understanding is that the boys are on top of this, and are going to figure it out and fix it.
The EPICS freeze that we had noticed a few weeks ago (and several times since) has happened again, but this time it has not come back on its own. It has been down for almost an hour so far.
So far, we have reset the Martian network's switch that is in the rack by the printer. We have also power cycled the NAT router. We have moved the NAT router from the old GC network switch to the new faster switch, and reset the Martian network's switch again after that.
We have reset the network switch that is in 1X6.
We have reset what we think is the DAQ network switch at the very top of 1X7.
So far, nothing is working. EPICS is still frozen, we can't ping any computers from the control room, and new terminal windows won't give you the prompt (so perhaps we aren't able to mount the nfs, which is required for the bashrc).
We need help please!
EricQ suggested it may be some NFS related issue: if something, maybe some computer in the control room, is asking too much to chiara, then all the other machines accessing chiara will slow down, and this could escalate and lead to the Big Bad Freeze. As a matter of fact, chiara's dmesg pointed out its eth0 interface being brought up constantly, as if something is making it go down repeatedly. Anyhow, after the shutdown of all the computers in the control room, a reboot of chiara, megatron and the fb was performed.
Then I rebooted pianosa, and most of the issues seem gone so far; I had to "mxstream restart" all the frontends from medm and everyone of them but c1scy seems to behave properly. I will now bring the other machines back to life and see what happens next.
Everything seems reasonably back to normal:
Steve and I switched chiara over to the UPS we bought for it, after ensuring the vacuum system was in a safe state. Everything went without a hitch.
Also, Diego and I have been working on getting some of the new computers up and running. Zita (the striptool projecting machine) has been replaced. One think pad laptop is missing an HD and battery, but the other one is fine. Diego has been working on a dell laptop, too. I was having problems editing the MAC address rules on the martian wifi router, but the working thinkpad's MAC was already listed.
Turns out that, as the martian wifi router is quite old, it doesn't like Chrome; using Firefox worked like a charm and now also giada (the Dell laptop) is on 40MARS.
Rana noted last week that TRX's value was stuck, not getting to the lsc from iscex. I tried restarting the individual models scx, lsc and even scy (since scy had an extra red rfm light), to no avail. I then did sudo shutdown -r now on iscex, and when it came back, the problem was gone. Also, I then did a diag reset which cleared all of the unusual red rfm lights.
Things seem fine now, ready to lock all the things.
I upgraded the GDS and ROOT installations in /ligo/apps/ubuntu12 the control room workstations:
My cursory tests indicate that they seem to be working:
Now that the control room environment has become somewhat uniform at Ubuntu 12, I modified the /ligo/cdscfg/workstationrc.sh file to source the ubuntu12 configuration:
controls@nodus|apps > cat /ligo/cdscfg/workstationrc.sh
# CDS WORKSTATION ENVIRONMENT
This should make all the newer versions available everywhere on login.
We are working on trying out the UGF servos, and wanted to take loop measurements with and without the servo to prove that it is working as expected. However, it seems like new DTT is not following the envelopes that we are giving it.
If we uncheck the "user" box, then it uses the amplitude that is given on the excitation tab. But, if we check user and select envelope, the amplitude will always be whatever number is the first amplitude requested in the envelope. If we change the first amplitude in the envelope, DTT will use that number for the new amplitude, so it is reading that file, but not doing the whole envelope thing correctly.
Thoughts? Is this a bug in new DTT, or a pebkac issue?
Chiara threw another network hissy fit. Dmesg was spammed with a bunch of messages like eth0: link up appearing rapidly.
eth0: link up
Some googling indicated that this error message in conjuction with the very ethernet board and driver that Chiara had in use could be solved by updating with an appropriate driver from the manufacturer.
In essence, I followed steps 1-7 from here: http://ubuntuforums.org/showthread.php?t=1661489
So far, so good. We'll keep an eye out to see how it works...
I was looking into the status of IPC communications in our realtime network, as Chris suggested that there may be more phase missing that I thought. However, the recent continual red indicators on a few of the models made it hard to tell if the problems were real or not. Thus, I set out to fix what I could, and have achieved full green lights in the CDS screen.
The frontend models have been svn'd. The BLRMs block has not, since its in a common cds space, and am not sure what the status of its use at the sites is...
EDIT: Sleepy Eric doesn't understand loops. The conditions for this observation included active oplev loops. Thus, obviously, looking at the in-loop signal after the ASC signl joins the oplev signal will produce this kind of behavior.
After some talking with Rana, I set out on making an even better-er QPD loop. I made some progress on this, but a new mystery halted my progress.
I sought to have a more physical undertanding of the plant TF I had measured. Earlier, I had assumed that the 4Hz plant features I had measured for the QPD loops were coming from the oplev-modified pendulum response, but this isn't actually consistent with the loop algebra of the oplev servos. I had seen this feature in both the oplev and qpd error signals when pushing an excitation from the ASC-XARM_PIT (and so forth) FMs.
However, when exciting via the SUS-ETMX-OLPIT FMs (and so forth), this feature would not appear in either the QPD or oplev error signals. That's weird. The outputs of these two FMs should just be summed, right before the coil matrix.
I started looking at the TF from ASC-YARM_PIT_OUT to SUS-ETMY_TO_COIL_1_2, which should be a purely digital signal routing of unity, and saw it exhibit the phase shape at 4Hz that I had seen in earlier measurements. Here it is:
I am very puzzled by all of this. Needs more investigation.
Does netgpibdata/TFSR785 work at the 40m currently? I rsynced the netgpibdata directory to LHO this morning to do some measurements, but I had to modify a few lines in order to get it to call the SR785 functions properly. My version is attached.
Does netgpibdata/TFSR785 work at the 40m currently?
It does appear to work here. However, I've since supplanted TFSR785 and SPSR785 with SRmeasure, which has some simpler command line options for directly downloading a manually configured measurement. I've also set up a git repository for the gpib scripts I've done at https://github.com/e-q/netgpibdata, which could be easier than grabbing the whole 40m directory.
So, I neglected to elog this yesterday, but yesterday we had one of those EPICS freezes that only affects slow channels that come from the fast computers. It lasted for about 5 minutes.
Right now, we're in the middle of another - it's been about 7 minutes so far.
Why are these happening again? I thought we had fixed whatever was the issue.
EDIT: And, it's over after a total of about 8 minutes.
Just now we had another EPICS freeze. The network was still up; i.e. I could ssh to chiara and fb, who both seemed to be working fine.
I could ping c1lsc successfully, but ssh just hung. fb's dmesg had some daqd segfault messages, so I telnet'ed to daqd and shut it down. Soon after, EPICS came back, but this is not neccesarily because of the daqd restart...
At about 10AM, the C1LSC frontend stopped reporting any EPICS information. The arms were locked at the time, and remained so for some hours, until I noticed the totally whited-out MEDM screens. The machine would respond to pings, but did not respond to ssh, so we had to manually reboot.
Soon thereafter, we had a global 15min EPICS freeze, and have been in a weird state ever since. Epics has come back (and frozen again), but the fast frontends are still wonky, even when EPICS is not frozen. Intermittantly, the status blinkers and GPS time EPICS values will freeze for multiple seconds at a time, sporadically updating. Looking at a StripTool trace of an IOPs GPS time value shows a line with smooth portions for about 30 seconds, about 2 minutes apart. Between this is totally jagged step function behavior. C1LSC needed to be power cycled again; trying to restart the models is tough, because the EPICS slowdown makes it hard to hit the BURT button, as is needed for the model to start without crashing.
The DAQ network switch, and martian switch inside were power cycled, to little effect. I'm not sure how to diagnose network issues with the frontends. Using iperf, I am able to show hundreds of Mbit/s bandwidth betweem the control room machines and the frontends, but their EPICS is still totally wonky.
What can we do???
The frontends have some paths NFS-mounted from fb. fb is on the ragged edge of being I/O bound. I'd suggest moving those mounts to chiara. I tried increasing the number of NFS threads on fb (undoing the configuration change I'd previously made here) and it seems to help with EPICS smoothness -- although there are still occasional temporal anomalies in the time channels. The daqd flakiness (which was what led me to throttle NFS on fb in the first place) may now recur as well.
I've been able to get all models running. Most optics are damped, but I'm having trouble with the ITMs, BS and PRM.
I noticed some diagnostic bits in the c1sus IOP complaining about user application timing and FIFO under/overflow (The second and fourth squares next to the DACs on the GDS screen.) Over in crackle-cymac land, I've seen this correlate with large excess DAC noise. After restarting all models, all but one of these is green again, and the optics are now all damped.
It seems there were some fishy BURT restores, as I found the TT control filters had their inputs and outputs switched off. Some ASS filters were found this way too. More undesired settings may still lurk in the mists...
The interferometer is now realigned, arms locked.
In an effort to ease the IO load on the framebuilder, I've cleaned up the DQ channels being written to disk. The biggest impact was seven 64kHz channels being written to disk, on ADC channels corresponding to microphones.
The frame files have gone from 75MB to 57MB.
I changed the suspension library block to acquire the SUS_[optic]_LSC_OUT channels at 16k for sensing matrix investigations. We could save the FB some load by disabling these and oplev channels in the mode cleaner optic suspensions.
I removed nonexistant PDs from c1cal, to try and speed it up from its constantly overflowing state. It's still pretty bad, but under 60us most of the time.
I also cleaned out the unused IPCs for simulated plant stuff from c1scx and c1sus, to get rid of red blinkeys.
I have successfully backported the cdsRampMuxMatrix part for use in our RCG v2.5 system. This involved grabbing new files, merging changes, and hacking around missing features from RCG 2.9.
The added/changed files, with relative paths referred to /opt/rtcds/rtscore/release/src/, are:
[A means the file was added, M means the file was modified]
Most of the trouble came from the EPICS reporting of the live ramping value and ramping state, since this depended on some future RCG value masking function. I had to rewrite the C-code writing perl script to define and update these EPICS variables in a more old-school way.
This leaves us vunerable to the fact that a user/program can directly write to the live matrix element and ramping state, which would cause bad and unexpected behavior of the matrix.
So, when using a ramping matrix: NEVER WRITE to [MAT]_[N]_[N] as you would for a normal matrix. Use [MAT]_SETTING_[N]_[M] and trigger [MAT]_LOAD_MATRIX.
Similary, [MAT]_RAMPING_[N]_[N] is off limits.
I tested the new part in the c1tst model. There are two EPICS input (TST-RAMP1_IN and TST-RAMP2_IN) that are the inputs to a 2x2 ramp matrix called TST-RAMP, and the outputs go to two testpoints (TST-RAMP1_OUT and TST-RAMP2_OUT) and two epics outputs (TST-RAMP1_OUTMON and TST-RAMP2_OUTMON). You can write something to the inputs from ezca or whatever, and use the C1TST_RAMP.adl medm screen in the c1tst directory to try it out. The buttons turn red when you've input a new matrix, yellow when a ramp is ongoing and green when the live value agrees with the setting.
At this time, I have not rebuilt any of our operational models in search of potential issues.
I have created backups of the files I modified, such that a file such as feCodeGen.pl was renamed feCodeGen.40m.pl, and left next to the modified file. I am open to more robust ways of doing the backup; since our RCG source is an svn checkout of the v2.5 branch (with local modifications, to boot), I suppose we don't want to commit there. Maybe we make a 40m branch? A seperate repo?
We were too much annoyed by frequent stall of mxstream. We'll update the RCG when time comes (not too much future).
But for now, we need an automatic mxstream resetting.
I found there is such a script already.
So this script was registered to crontab on megatron.
It is invoked every 5 minutes.
# Auto MXstream reset when it fails
0,5,10,15,20,25,30,35,40,45,50,55 * * * * /opt/rtcds/caltech/c1/scripts/cds/autoMX >> /opt/rtcds/caltech/c1/scripts/cds/autoMX.log
Upgraded python on megatron. Added lines to the crontab to run autoMX.py. Edited crontab to have a PYTHONPATH so that it can run .py stuff.
But autoMX.py is still not working from inside of cron, just from command line.
Since python from crontab seemed intractable, I replaced autoMX.py with a soft link that points at autoMX.sh.
This is a simple BASH script that looks at the LSC FB stat (C1:DAQ-DC0_C1LSC_STATUS), and runs the restart mxstream script if its non-zero.
So far its run 5 times successfully. I guess this is good enough for now. Later on, someone ought to make it loop over other FE, but this ought to catch 99% of the FB issues.
I found the c1lsp and c1sup models not running anymore on c1lsc (white blocks for status lights on medm).
To fix this, I ssh'd into c1lsc. c1lsc status did not show c1lsp and c1sup models running on it.
I tried the usual rtcds restart <model name> for both and that returned error "Cannot start/stop model 'c1XXX' on host c1lsc".
I also tried rtcds restart all on c1lsc, but that has NOT brought back the models alive.
Does anyone know how I can fix this??
c1sup runs some the suspension controls. So I am afraid that the drift and frequent unlocking of the arms we see might be related to this.
P.S. We might also want to add the FE status channels to the summary pages.
I just found out that c1lsp and c1sup models no more exist on the FE status medm screens. I am assuming some changes were done to the models as well.
Earlier today, I was looking at some of the old medm screens running on Donatella that did not reflect this modification.
Did I miss any elogs about this or was this change not elogged??
was this change not elogged??
This is my sin.
Back in Febuary (around the 25th) I modified c1sus.mdl, removing the simulated plant connections we weren't using from c1lsp and c1sup. This was included in the model's svn log, but not elogged.
The models don't start with the rtcds restart shortcut, because I removed them from the c1lsc line in FB:/diskless/root/etc/rtsystab (or c1lsc:/etc/rtsystab). There is a commented out line in there that can be uncommented to restore them to the list of models c1lsc is allowed to run.
However, I wouldn't suspect that the models not running should affect the suspension drift, since the connections from them to c1sus have been removed. If we still have trends from early February, we could look and see if the drift was happening before I made this change.
The CDSUTILS package has a feature where it substitutes in a C1 or H1 or L1 prefix depending upon what site you are at. The idea is that this should make code portable between LLO and LHO.
Here at the 40m, we have no need to do that, so its better for us to be able to copy and paste channel names directly from MEDM or whatever without having to remove the "C1:" from all over the place.
the way to do this on the command line is (in bash) to type:
To make this easier on us, I have implemented this in our shared .bashrc so that its always the case. This might break some scripts which have been adapted to use the weird CDSUTILS convention, so beware and fix appropriately.
This makes things act weird:
controls@pianosa|MC 1> z avg 1 "C1:LSC-TRY_OUT"
IFO environment variable not specified.