Yes - my rampdown.py script correctly ramps down the watchdog thresholds. This replaces the old rampdown.pl Perl script that Rob and Dave Barker wrote.
Unfortunately, cron doesn't correctly inherit the bashrc environment variables so its having trouble running.
On a positive note, I've resurrected the MEDM Screenshot taking cron job, so now this webpage is alive (mostly) and you can check screens from remote:
Added this to the megatron crontab and commented out the op340m crontab line. IF this works for awhile we can retire our last Solaris machine.
For some reason, my email address is the one that megatron complains to when cron commands fail; since 11:15PM last night, I've been getting emails that the rampdown.py line is failing, with the super-helpful message: expr: syntax error
expr: syntax error
Looking at the summary page trends from today, you can see that the MC transmission is pretty flat after I zeroed the MCWFS offsets. In addition, the transmission from both arms is also flat, indicating that our previous observation of long term drift in the Y arm transmission probably had more to do with bad Y-arm initial alignment than unbalanced ETMY coil-magnets.
Much like checking the N2 pressure, amount of coffee beans, frames backups, etc. we should put MC WFS offset adjustment into our periodic checklist. Would be good to have a reminder system that pings us to check these items and wait for confirmation that we have done so.
Tried swapping cables at the Guralp interface box side. It seems that all of our seismic signal problems have to do with the GUR2 cable being flaky (not surprising since it looks like it was patched with Orange Electrical tape!! rather than proper mechanical strain relief).
After swapping the cables today, the GUR2 DAQ channels all look fine: i.e. GUR1 (the one at the Y end) is fine, as is its cable and the GUR2 analog channels inside the interface box.
OTOH, the GUR1 DAQ channels (which have GUR2 (EX) connected into it) are too small by a factor of ~1000. Seems like that end of the cable will need to be remade. Luckily Jenne is still around this week and can point us to the pinout / instructions. Looks like there could be some shorting inside the backshell, so I've left it disconnected rather than risk damaging the seismometer. We should get a GUR1 style backshell to remake this cable. It might also be possible that the end at the seismometer is bad - Steve was supposed to swap the screws on the granite-aluminum plate on Thursday; I'll double check.
1) Checked the N2 pressures: the unregulated cylinder pressures are both around 1500 PSI. How long until they get to 1000?
2) The IMC has been flaky for a day or so; don't know why. I moved the gains in the autolocker so now the input gain slider to the MC board is 10 dB higher and the output slider is 10 dB lower. This is updated in the mcdown and mcup scripts and both committed to SVN. The trend shows that the MC was wandering away after ~15 minutes of lock, so I suspected the WFS offsets. I ran the offsets script (after flipping the z servo signs and adding 'C1:' prefix). So far powers are good and stable.
3) pianosa was unresponsive and I couldn't ssh to it. I powered it off and then it came back.
4) Noticed that DAQD is restarting once per hour on the hour. Why?
5) Many (but not all) EPICS readbacks are whiting out every several minutes. I remote booted c1susaux since it was one of the victims, but it didn't change any behavior.
6) The ETMX and ITMX have very different bounce mode response: should add to our Vent Todo List. Double checked that the bounce/roll bandstop is on and at the right frequency for the bounce mode. Increased the stopband from 40 to 50 dB to see if that helps.
7) op340 is still running ! The only reason to keep it alive is its crontab:
07 * * * * /opt/rtcds/caltech/c1/burt/autoburt/burt.cron >> /opt/rtcds/caltech/c1/burt/burtcron.log
#46 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/FSSSlowServo > /cvs/cds/caltech/logs/scripts/FSSslow.cronlog 2>&1
#14,44 * * * * /cvs/cds/caltech/conlog/bin/check_conlogger_and_restart_if_dead
15,45 * * * * /opt/rtcds/caltech/c1/scripts/SUS/rampdown.pl > /dev/null 2>&1
#10 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/MC/autolockMCmain40m >/cvs/cds/caltech/logs/scripts/mclock.cronlog 2>&1
#27 * * * * /opt/rtcds/caltech/c1/scripts/general/scripto_cron /opt/rtcds/caltech/c1/scripts/PSL/FSS/RCthermalPID.pl >/cvs/cds/caltech/logs/scripts/RCthermalPID.cronlog 2>&1
00 0 * * * /var/scripts/ntp.sh > /dev/null 2>&1
#00 4 * * * /opt/rtcds/caltech/c1/scripts/RGA/RGAlogger.cron >> /cvs/cds/caltech/users/rward/RGA/RGAcron.out 2>&1
#00 6 * * * /cvs/cds/scripts/backupScripts.pl
00 7 * * * /opt/rtcds/caltech/c1/scripts/AutoUpdate/update_conlog.cron
00 8 * * * /opt/rtcds/caltech/c1/scripts/crontab/backupCrontab
added a new script (scripts/SUS/rampdown.py) which decrements every 30 minutes if needed. Added this to the megatron crontab and commented out the op340m crontab line. IF this works for awhile we can retire our last Solaris machine.
8) To see if we could get rid of the wandering PCDRIVE noise, I looked into the NPRO temperatures: was - T_crystal = 30.89 C, T_diode1 = 21 C, T_diode2 = 22 C. I moved up the crystal temp to 33.0 C, to see if it could make the noise more stable. Then I used the trimpots on the front of the controller to maximimize the laseroutput at these temperatures; it was basically maximized already. Lets see if there's any qualitative difference after a week. I'm attaching the pinout for the DSUB25 diagnostics connector on the back of the box. Aidan is going to help us record this stuff with AcroMag tech so that we can see if there's any correlation with PCDRIVE. The shifts in FSS_SLOW coincident with PCDRIVE noise corresponds to ~100 MHz, so it seems like it could be NPRO related.
The CDSUTILS package has a feature where it substitutes in a C1 or H1 or L1 prefix depending upon what site you are at. The idea is that this should make code portable between LLO and LHO.
Here at the 40m, we have no need to do that, so its better for us to be able to copy and paste channel names directly from MEDM or whatever without having to remove the "C1:" from all over the place.
the way to do this on the command line is (in bash) to type:
To make this easier on us, I have implemented this in our shared .bashrc so that its always the case. This might break some scripts which have been adapted to use the weird CDSUTILS convention, so beware and fix appropriately.
Vacuum Operation Guide is up loaded into the 40m-wiki. This is an old master copy. Not exact in terms of real action, but it is still a good guide of logic.
Rana has promissed to watch the N2 supply and change cylinder when it is empty. I will be Hanford next week.
Today Steve and I tried to recenter the Guralps. The breakout box technique didn't work for us, so we just turned the leveling screws until we got the mass position outputs within +/-50 mV for all DoF as read out by the breakout box.
The attachment shows the 6 channels after our work. You can see that GUR2_X/Y still look deadish. I tried wiggling the cables at the interface box and powering on/off, but no luck. Next, we swap cables.
Tried to bring the weather station back to life, but no luck. The unit on the wall is alive and so is the EPICS IOC (c1pem1). But there is apparently no communication between them. telnet into c1pem and the error message repeating at the prompt is:
Weather Monitor Output: NO COMM
Might be related to the flaky connector situation that Liz and I found there a couple summers ago, but I tried jiggling and reseating that one with no luck. Looks like it stopped working around 8 PM on March 24, 2014. That's the same time as a ~30s power outage, so perhaps we just need some more power cycling? Tried hitting the reset button on the VME card for c1pem1, but didn't change anything.
Let's try power cycling that crate (which has c1pem1, c0daqawg, and some GPS receiver)...nope - no luck.
Also tried power cycling the weather box which is near the BS chamber on the wall. This didn't change the error message at the c1pem1 telnet prompt.
COD_Sugar napolion is due to Steve: Item delivered, model CMG-SCU-0013, sn G9536
Reward being offered for the safe return of this thing:
Still seems to be running without causing FB issues. One thought is that we could look through the FB status channel trends and see if there is some excess of FB problems at 10 min after the hour to see if its causing problems.
I also looked into our minute trend situation. Looks like the files are comrpessed and have checksum enabled. The size changes sometimes, but its roughly 35 MB per hour. So 840 MB per day.
According to the wiper.pl script, its trying to keep the minute-trend directory to below some fixed fraction of the total /frames disk. The comment in the scripts says 0.005%,
but I'm dubious since that's only 13TB*5e-5 = 600 MB, and that would only keep us for a day. Maybe the comment should read 0.5% instead...
The rsync job to sync our frames over to the cluster has been on a 20 MB/s BW limit for awhile now.
Dan Kozak has now set up a cronjob to do this at 10 min after the hour, every hour. Let's see how this goes.
You can find the script and its logfile name by doing 'crontab -l' on nodus.
Baking both CC1 at 85 C for 60 hrs did not help.
The temperature is increased to 125 C and it is being repeated.
CC1s are not reading any longer. It is an attempt to clean them over the weekend at 85C
These brand new gauges "10421002" sn 11823-vertical, sn 11837 horizontal replaced 11 years old 421 http://nsei.missouri.edu/manuals/hps-mks/421%20Cold%20Cathode%20Ionization%20Guage.pdf on 09-06-2012 http://nodus.ligo.caltech.edu:8080/40m/7441
We have two cold cathode gauges at the pump spool and one signal cable to controller. CC1 in horizontal position and CC1 in vertical position.
CC1 h started not reading so I moved cable over to CC1 v
* Relocked IMC. I guess it was stuck somewhere in the autlocker loop. I disabled autolocker and locked it manually. Autolocker has been reenabled and seems to be running just fine.
* The X arm has been having trouble staying locked. There seemed to be some amount of gain peaking. I reduced the gain from 0.007 to 0.006.
* I disabled the triggered BounceRG filter : FM8 in the Xarm filter module. We already have a triggered Bounce filter: FM6 that takes care of the noise at bounce/roll frequencies. FM8 was just adding too much gain at 16.5Hz. Once this filter was disabled the X arm lock has been much more stable.
Also, the Y arm doesn't use FM8 for locking either.
was this change not elogged??
This is my sin.
Back in Febuary (around the 25th) I modified c1sus.mdl, removing the simulated plant connections we weren't using from c1lsp and c1sup. This was included in the model's svn log, but not elogged.
The models don't start with the rtcds restart shortcut, because I removed them from the c1lsc line in FB:/diskless/root/etc/rtsystab (or c1lsc:/etc/rtsystab). There is a commented out line in there that can be uncommented to restore them to the list of models c1lsc is allowed to run.
However, I wouldn't suspect that the models not running should affect the suspension drift, since the connections from them to c1sus have been removed. If we still have trends from early February, we could look and see if the drift was happening before I made this change.
I saw that entry, but it doesn't state what the calibration is in units of Hz/counts. It just gives the final calibrated spectrum.
Arm powers had drifted to ~ 0.5 in transmission.
X and Y arms were locked and ASS'd to bring the arm transmission powers to ~1.
I just found out that c1lsp and c1sup models no more exist on the FE status medm screens. I am assuming some changes were done to the models as well.
Earlier today, I was looking at some of the old medm screens running on Donatella that did not reflect this modification.
Did I miss any elogs about this or was this change not elogged??
I found the c1lsp and c1sup models not running anymore on c1lsc (white blocks for status lights on medm).
To fix this, I ssh'd into c1lsc. c1lsc status did not show c1lsp and c1sup models running on it.
I tried the usual rtcds restart <model name> for both and that returned error "Cannot start/stop model 'c1XXX' on host c1lsc".
I also tried rtcds restart all on c1lsc, but that has NOT brought back the models alive.
Does anyone know how I can fix this??
c1sup runs some the suspension controls. So I am afraid that the drift and frequent unlocking of the arms we see might be related to this.
P.S. We might also want to add the FE status channels to the summary pages.
The last MC_F calibration was done by Ayaka : Elog 7823
And does anyone know what the MC_F calibration is?
I have created a wiki page with introductory info about the summary page configuration: https://wiki-40m.ligo.caltech.edu/Daily summary help
We can also use that to collect tips for editing the configuration files, etc.
I have set up new summary pages for the 40m: http://www.ligo.caltech.edu/~misi/summary/
This website shows plots (time series, spectra, spectrograms, Rayleigh statistics) of relevant channels and is updated with new data every 30 min.
The content and structure of the pages is determined by configuration files stored in nodus:/users/public_html/gwsumm-ini/ . The code looks at all files in that directory matching c1*.ini. You can look at the c1hoft.ini file to see how this works. Besides, a quick guide to the format can be found here http://www.ligo.caltech.edu/~misi/iniguide.pdf
Please look at the pages and edit the config files to make them useful to you. The files are under version control, so don’t worry about breaking anything.
Do let me know if you have any questions (or leave a comment in the pages).
Like Steve pointed out, the summary pages show that the y-arm transmission drifts a lot when locked. The OL summary page shows that this is all due to ITMY yaw.
Could be either that they coil driver / DAC is bad or that the suspension is poorly built. We need to dig into ITMY OL trends over long term to see if this is new or now.
Also, weather station needs a reboot. And does anyone know what the MC_F calibration is?
Also, EQ gave us a better (and not pwd protected) URL for the summary pages. Please replace your previous links with this new one:
Why is the Y arm drifting so much?
The " PSL FSS Slow Actuator Adjust " was brought back to range from 1.5 to 0.3 yesterday as ususual. Nothing else was touched.
I'm not sure if the timing scale is working correctly on theses summery plots. What is the definition of today?
The y-arm became much better as I noticed it at 5pm
As it was requested by the Bos.
It would be nice to read from the epic screen C1:Vac-state_mon.......Current State: Vacuum Normal, valve configuration
Maglev is the main pump of our vacuum system below 500 mTorr
It's long term pressure has to be <500 mTorr
Each chamber has it's own annulos. These small volumes are indipendent from main volume. Their pressure ranges are <5 mTorr at vac. normal valve configuration.
CC1=cold cathode gauge (low emmision), Pressure range: 1e-4 to 1e-10 Torr,
In vac- normal configuration CC1= 2e-6 Torr
The N2 supply is regulated to 60-80 PSI out put at the auto cylinder changer.
Each cylinder pressure will be measured before the regulator and summed for warning message to be send
at 1000 PSI
ln -s /users/public_html/$MYPAGE /export/home/
Koji suggested that I make a cosine fit for the curve instead of a linear fit.
I fit the data to
where L - cable length asymmetry (27 cm) , fb - beat frequency and v - velocity of light in the cable (2*108 m/s)
The plot with the cosine fit is attached.
Fit coefficients (with 95% confidence bounds):
A = 0.4177 (0.3763, 0.4591)
B = 2.941 (2.89, 2.992)
I left the arms locked last night. Looks like the drift in the Y arm power is related to the Y arm control signal being much bigger than X.
Attached is the schematic of the analog DFD and the plot showing the zero-crossing for a delay line length of 27cm. The bandwidth for the linear output signal obtained roughly matches what is expected from the length difference (370MHz) .
We could use a smaller cable to further increase our bandwidth. I propose we use this analog DFD to determine the range at which the frequency counter needs to be set and then use the frequency counter readout as the error signal for FOL.
X arm was far out in yaw, so I reran the ASS for Y and then X. Ran OK; the offload from ASS outputs to SUS bias is still pretty violent - needs smoother ramping.
After this I recentered the ITMX OL- it was off by 50 microradians in pitch. Just like the BS/PRM OLs, this one has a few badly assembled & flimsly mounts. Steve, please prepare for replacing the ITMX OL mirror mounts with the proper base/post/Polaris combo. I think we need ~3 of them. Pit/yaw loop measurements attached.
Based on the PEM-SEIS summary page, it looked like GUR1 was oscillating (and thereby saturating and suppressing the Z channel). So I power cycled both Guralps by turning off the interface box for ~30 seconds and the powering back on. Still not fixed; looks like the oscillations at 110 and 520 Hz have moved but GUR2_X/Y are suppressed above 1 Hz, and GUR1_Z is suppressed below 1 Hz. We need Jenne or Zach to come and use the Gur Paddle on these things to make them OK.
From the SUS-WatchDog summary page, it looked like the PRM tripped during the little 3.8 EQ at 4AM, so I un-tripped it.
Caryn's temperature sensors look like they're still plugged in. Does anyone know where they're connected?
Looks like some of our seismometers are oscillating, not mounted well, or something like that. No reason for them to be so different.
Which Guralp is where? And where are our accelerometers mounted?
Same thing again today. So I renamed the /etc/init/elog.conf so that it doesn't keep respawning bootlessly. Until then restart elog using the start script in /cvs/cds/caltech/elog/ as usual.
I'll let EQ debug when he gets back - probably we need to pause the elog respawn so that it waits until nodus is up for a few minutes before starting.
Since the nodus upgrade, Eric/Diego changed the old csh restart procedures to be more UNIX standard. The instructions are in the wiki.
After doing some software updates on nodus today, apache and elogd didn't come back OK. Maybe because of some race condition, elog tried to start but didn't get apache. Apache couldn't start because it found that someone was already binding the ELOGD port. So I killed ELOGD several times (because it kept trying to respawn). Once it stopped trying to come back I could restart Apache using the Wiki instructions. But the instructions didn't work for ELOGD, so I had to restart that using the usual .csh script way that we used to use.
Still processing, but I think it should work fine once we have a day of data. Until then, here's the summary pages so far, including Vac channels:
Rana asked me to include add slow outputs (OUT16) of the seismometer BLRMS channels to the frames.
All of the PEM slow channels are already set up in c1/chans/daq/C1EDCU_PEM.ini, but up to this point, daqd had no knowledge of this file, since it wasn't included in c1/target/fb/master, which defines all the places to look for files describing channels to be written to disk. This file already includes lines for C1EDCU_LSC.ini and such, which from old elogs, looks like was set up by hand for subsystems we care about.
Hence, since we now care about slow trends for the PEM subsystem, I have added a line to the daqd master file to tell it to save the PEM slow channels. This looks to have increased the size of the individual 16 second frame files from 57MB to 59MB, which isn't so bad.
We have 2 transduser PX303-3KG5V http://www.omega.com/pressure/pdf/PX303.pdf They will be installed on the out put of the N2 cylinders to read the supply pressure.
I will order one DC power supply http://www.omega.com/pptst/PSU93_FPW15.html PSU-93
One full cylinder pressure is ~ 2400 PSI max so two of them will give us ~9Vdc
The email reminder should be send at 1000 PSI = 1.8 V
Installed libmotif3 and libmotif4 on nodus so that we can run dataviewer on there.
Also, the lscsoft stuff wasn't installed for apt-get, so I did so following the instructions on the DASWG website:
Then I installed libmetaio1, libfftw3-3. Now, rather than complain about missing librarries, diaggui just silently dies.
Then I noticed that the awggui error message tells us to use 'ssh -Y' instead of 'ssh -X'. Using that I could run DTT on nodus from my office.
We want to have a VAC page in the summaries, so Steve - please put a list of important channel names for the vacuum system into the elog so that we can start monitoring for trouble.
Also, anyone that has any ideas can feel free to just add a comment to the summary pages DisQus comment section with the 40m shared account or make your own account.
Based on Jenne's chiara disk usage monitoring script, I made a script that checks the N2 pressure, which will send an email to myself, Jenne, Rana, Koji, and Steve, should the pressure fall below 60psi. I also updated the chiara disk checking script to work on the new Nodus setup. I tested the two, only emailing myself, and they appear to work as expected.
The scripts are committed to the svn. Nodus' crontab now includes these two scripts, as well as the crontab backup script. (It occurs to me that the crontab backup script could be a little smarter, only backing it up if a change is made, but the archive is only a few MB, so it's probably not so important...)
This watch script gives little time to replace N2 cylinder. When the regulated supply drops below 60 psi the cylinder pressure is 60 psi too.
It is more of a statement that V1 is closed and act accordingly. It's only practical if you are in the lab.
Rana pointed it out correctly that we need this message 24 hrs before it happens. This requires monitoring of total supplies , not the regulated one.
So we need pressure transducers on each nitrogen cylinder, before the regulator. The sum of the two N2 cylinder when they are full 4000 - 4500 psi
The first email should be send out at 1000 psi as sum of the two cylinders. This means that you have 1 day to replace nitrogen cylinder.
Most of the time the daily consumption is 750 +-50 psi
However sometimes this variation goes up ~750 +-150 psi
Yesterday morning was dusty. I wonder why?
The PRM sus damping was restored this morning.
Yesterday afternoon at 4 the dust count peaked 70,000 counts
Manasa's alergy was bad at the X-end yesterday. What is going on?
There was no wind and CES neighbors did not do anything.
Air cond filters checked by Chris. The 400 days plot show 3 bad peaks at 1-20, 2-5 & 2-19
I'm sad. And frustrated.
The PRCL angular feed forward is not working, and without it I am having a very difficult time keeping the PRMI locked while the arms are at high power (either buzzing, or the one time I got stable high power partway through the transition). Obviously if the PRMI unlocks once CARM and DARM are mostly relying on the REFL signals, I lose the whole IFO.
Q and I had been noticing over the last few weeks that the angular feed forward wasn't seeming quite as awesome as it did when I first implemented it. We speculated that this was likely because we had started DC coupling the ITM optical levers, which changes the way seismic motion is propagated to cavity axis motion (since the ITMs are reacting differently).
Anyhow, today it does not work at all. It just pushes the PRM until the PRMI loses lock. I am worried that, even though Rana re-tuned the BS and PRM oplev servos to be very similar to how they used to be, there is enough of a difference (especially when compounded with the DC coupled ITMs) that the feed forward transfer functions just aren't valid anymore.
Since this prevents whole IFO locking, I spent some time trying to get it back under control, although it's still not working.
I remeasured the actuator transfer function of how moving PRM affects the sideband spot at the QPD, in the PRMI-only situation. I didn't make a comparison plot for the yaw degree of freedom, but you can see that the pitch transfer function is pretty different below ~20Hz, which is the whole region that we care about. In the plot below, black is from January (PRMI-only, no DC-coupled ITMs) and blue is from today (PRMI-only, with DC-coupled ITMs, and somewhat different BS/PRM oplev setup):
I calculated new Wiener filters, and tried to put them in, but sometimes (and I don't understand what the pattern is yet) I get "error" in the Alternate box, rather than the zpk version of my sos filter. It seems to go away if you use fewer and fewer poles for fitting the Wiener filters, but then the fit is so poor that you're not going to get any subtraction (according to the residual estimation plot that uses the fitted filters rather than the ideal Wiener filters). The pitch filters could only handle 6 poles, although the yaw filters were fine with 20.
The feed forward just keeps pushing the PRM away though. I flipped the signs on the Wiener filters, I tried recalculating without the actuator pre-filtering, I don't know why it's failing. But, I'm not able to lock the interferometer. Which sucks, because I was hoping to finally get most of my noise coupling measurements done today.
I have set up new summary pages for the 40m: http://www.ligo.caltech.edu/~misi/summary/
This website shows plots (time series, spectra, spectrograms, Rayleigh statistics) of relevant channels and is updated with new data every 30 min.
After last week's work on the BS/PRM oplev table, I think the PRM oplev got centered while the PRM was misaligned. With the PRM aligned, the oplev spot was not on the QPD. It has been centered.
Thank you both.
I have updated the .snap file, so that it'll use these parameters, as Rana left them. Also, so that the "unfreeze" script works without changes (since it wants to make the overall gain 1), I have changed the Xarm input matrix elements from 1 to 0.1, for all of them. This should be equivalent to the overall gain being 0.1.
I have unplugged POXDC and POYDC from their whitening inputs. They have labels on them which whitening channel they belong to (POY=5, POX=6) on the DCPD whitening board.
TT3_LR's DAC output is Tee-ed, going to the POYDC input and also to an SR560 near the Marconi.
TT4_LR's DAC output is Tee-ed, going to the POXDC input and also to the CM board's ExcB input.
Today I tried some things, but basically, lowering the input gain by 10 made the thing stable. In the attached screenshotstrip, you can see what happens with the gain at 1. After a few cycles of oscillation, I turned the gain back to 0.1.
There still is an uncontrolled DoF, but I that's just the way it is since we only have one mirror (the BS) to steer into the x arm once the yarm pointing is fixed.
Along the way, I also changed the phase for POX, just in case that was an issue. I changed it from +86 to +101 deg. The attached spectra shows how that lowered the POX_Q noise.
I also changed the frequencies for ETM_P/Y dither from ~14/18 Hz to 11.31/14.13 Hz. This seemed to make no difference, but since the TR and PO signals were quieter there I left it like that.
This is probably OK for now and we can tune up the matrix by measuring some sensing matrix stuff again later.
If it all possible, don't use links to your home directory. Its not robust. It would be like if you clicked on your Google Music and it told you to ask me to sing that song to you. Imagine that on auto-repeat next time your fancy-bone itches.
Since python from crontab seemed intractable, I replaced autoMX.py with a soft link that points at autoMX.sh.
This is a simple BASH script that looks at the LSC FB stat (C1:DAQ-DC0_C1LSC_STATUS), and runs the restart mxstream script if its non-zero.
So far its run 5 times successfully. I guess this is good enough for now. Later on, someone ought to make it loop over other FE, but this ought to catch 99% of the FB issues.
Ugh, this turns out to be because cron doesn't source the controls bashrc that defines where to find caget and all that jazz that many commands depend on. This is probably also why the AutoMX cron job isn't working either.
Also, cron automatically emails everything from stderr to the email address that is configured for the user, which is why the n2 script blew up the foteee account and why the AutoMX script was blowing up my email yesterday. This can be avoided by doing something like this in the crontab:
0 8 * * * /bin/somecommand >> somefile.log 2>&1
(The >> part means that the standard output is appended to some log file, while the 2>&1 means send the standard error stream to the same place as stdout)
I made this change for the n2 script, so the foteee email account should be safe from this script. I haven't figured out the right way to set up cron to have all the right $PATH and other environment stuff, such as epics may need, so the script is still not working.