For CARM and DARM, the A channels are used for the ALS signals, whereas the B channels are used for blending the RF signals.
For the DRMI, the A channels are used for the 1F signals, whereas the B channels are used for the 3F signals. The settings for transitioning to 1F after locking the DRFPMI have not yet been determined.
These settings are currently saved in the DRMI configurator, but the demod angles are set for DRFPMI lock, so the settings don't reliably work for misaligned arms.
The REFL33 element in SRCL_B is to reduce the PRCL coupling, was found empirically by tuning the relative gains with the arms misaligned and looking at excitation line heights. The offsets were found by locking the DRMI on 1F signals with arms misaligned, and taking the average value of these 3F error signals.
The CARM and DARM ALS settings are largely scripted by scripts/ALS/Transition_IR_ALS.py, which takes you from arms POX/POY locked to CARM and DARM ALS locked. The DRMI settings are usually restored from the IFO_CONFIGURE screen.
When arms are POX/POY locked, and the green beatnotes are appropriately configured, calling scripts/DRFPMI/carm_cm_up.sh initiates the following sequence of events:
When CARM and DARM are buzzing around true zero, powers maximized:
This is as far as we've taken the DRFPMI so far, but the CARM bandwidth is still only at a few kHz. Based on PRFPMI locking, the next steps will be:
Last Friday, I installed some RF couplers on the green BBPDs' outputs, and sent them over to Gautam's frequency divider module. At first I tried 20dB couplers, but it seemed like not enough power was reaching the dividers to produce a good output. I could only find one 10dB coupler, and I stuck that on the X BBPD. With that, I could see some real signals come into the digital system.
I don't think it should be a problem to leave the couplers there during other activities.
Gautam was working on his digital frequency counter stuff when the c1als model crashed. I had trouble bringing it back until I realized that, for reasons unknown to me, the safe.snap file that the model looks for at boot had been deleted. (This file lives in /target/c1als/c1alsepics/burt). I copied over this morning's version from Chiara's local backup.
At the sites, these files are under version control in the userapps svn repository, presumably symlinked into the target directory. We should definitely do something along these lines.
Here is a longer lock, about 100 seconds RF only, from later that same night. The in-loop CARM and DARM error signals have the order of magnitude of 1nm per count.
From ~-150 to -103, we were fine tuning the ALS offsets to try and get close to the real CARM/DARM zero points then blending the RF CARM signal.
At -100, the CARM bandwidth increases to a few kHz and stabilizes the arm powers. By -81, the error signals are all RF. At -70, I turned on the transmon QPD servos, which brought the power up a bit.
If I recall correctly, lock was lost because I put waaaay too big of an excitation on DARM with the goal of running its UGF servo for a bit. The number I entered was appropriate for ALS, but most certainly too huge for AS55...
Steve pointed out that in the aftermath of the Nitrogen running out a couple of times last week, the RGA had shut itself off thinking that there was a leak and so it was not performing the scheduled scans once a day. So the data files from the scheduled scans were empty in the /opt/rtcds/caltech/c1/scripts/RGA/logs directory. The wiki page for getting it up and running again is up-to-date, but the script RGAset.py did not exist on the c0rga machine, which the RGA is communicating with via serial port. I copied over the script RGAset.py from rossa to c0rga and ran the script on that machine - but the error flags it returned were not all 0 (indicating some error according to the manual) - so I edited the script to send just the initialize command ('IN0') and commented out the other commands, after which I got error flags which were all 0. After this, I ran a manual scan using 'RGAlogger.py', and it appears that the RGA is now able to take scans again - I'm attaching a plot of the scan results. We've saved this scan as a reference to compare against after a few days.
in addition to Koji's words I feel like we should also thank those who made small but positive contributions. Its hard not to notice that this locking only happened after the new StripTool PEM colors were implemented...
From the times series plot I guess that the fuzz of the in-loop DARM is 1 pm RMS (based on memory). This means that the ALS was holding the DARM at 10 pm from the RF resonances.
There is no significant shift in the DRMI error signals, so new weird CARM effect. Would be interesting to see what the 1f signals do in the last 60 seconds before RF lock.
For documentation, perhaps Gautam can post the loop gain measurements of the 5 loops on top of the Bode plots of the loop models.
Vacuum normal configuration with VM1-closed, VM2-open valve positions. Power load normal 24V 0.2A
Maglev rotation 560 Hz at room temp body temp.
"NO COMM" error message on medm screen. Gauge controller pressures are read able.
All vac comp LEDs are green. We have to reboot on Monday to enable communication.
None of the links here seem to work. I forgot what the story is with our special apache redirect
The story is: we currently don't expose the whole /users/public_html folder. Instead, we are symlinking the folders from public_html to /export/home/ on nodus, which is where apache looks for things
So, I fixed the links on the Core Optics page by running:
controls@nodus|~ > ln -sfn /users/public_html/40m_phasemap /export/home/
Fast ALS was still a problem tonight. I don't think high frequency ALS noise saturating the PC drive is the issue; I put two 10k poles before the CM board (shooting for just 2-3kHz bandwidth), and the PC drive levels would be stable and low up until the lockloss, which was always conincident with a step in the AO gain.
After working with that for a few hours, we turned back to our more standard locking attempts. First, we dither aligned the PRMI, and then centered the REFL beam on REFL11. It's hard to say for certain, but we may have been a little close to the edge of the PD. The only other thing that differed from Monday's attempts was using 6dB less AO gain when trying the up the overall gain.
The script now reliably breaks through to stable high powers, we had a handful of pure-RF locks tonight. The digital DARM gain needs tuning, and the CARM bandwidth still isn't at its final state, but these are very tractable. Off the top of my head, the way forward now includes:
Unrelated: I feel that the PRC angular FF may have deteriorated a bit. I'm leaving the PRC locked on carrier to collect data for wiener filter recalculation.
For real this time.
I carried out some further diagnostics and found some ways in which I could optimize the zero-crossing-counting algorithm, such that the error in the measured frequency is now entirely within the expected range (due to a +-1 clock cycle error in the counting). We can now determine frequencies up to ~60 MHz with less than 1 MHz systematic error and <10 kHz statistical error (fluctuations after the 20 Hz lowpass). This should be sufficient for slow control of the end-laser temperatures.
The conclusion from my earleir tests was that there was possibly an improvement that could be made to setting the thresholds for the Schmitt trigger stage in the model. In order to investigate this, I wanted to have a look at the 64K sampled raw input to the ADCs. Yesterday Eric helped me edit the appropriate .par file for viewing these channels for c1x03, and for an input frequency of 70MHz (after division, ~4.3 kHz square wave), the signal looked as expected (top left plot, attachment #1). This prompted me to check the counting algorithm again with the help of various test points I had setup within the model. I found that there was a tendency to under-count the number of clock-cycles between zero-crossings by more than 1 clock cycle, due to the way my code was organized. I fixed this and found that the performance improved dramatically, compared to my previous trials. With the revised counting algortihm, there was at most a +-1 clock cycle error in the counting, and the systematic error between the measured and requested RF frequencies is now completely accounted for taking this consideration into account. The origin of this residual error can be understood by looking at the top right plot in Attachment #1 - presumably because of the effects of some downsampling filter, the input signal to the Schmitt trigger isnt a clean square wave (even at 4kHz) - specifically, the time spent in the LOW and HIGH states of the Schmitt trigger can vary between successive zero crossings because of the shape of the input waveform. As a result, there can be a +-1 clock cycle error in the counting process. Attachment #2 shows this - the red and blue lines envelope the measured frequency for the whole range investigated: 10-70MHz. Attachment #3 shows the systematic error as a function of the requested frequency.
If there was some way to bypass the downsampling filter, perhaps the high-frequency performance could be improved a little.
Eric needed a buffer to drive low input impedance (~130Ohm) of his pomona box, I quickly made a non-iverting buffer with G=+10. The DC power is obtained from the back of SR560. It uses 1.02K and 9.09K
to have the gain of ~10. The chip is OP27. In fact this limits the output swing to be +/-5V for the load resistance of 130Ohm. Eric thinks this is enough. If we need more, we need to swap the chip.
As SR560s tend to saturate too quickly, it would be very useful to have this kind of kit in all the labs
once it is packed in a box.
After some discussion at last week's 40m meeting, I increased the frequency of daqd trying to write out minute trends from hourly to every two hours.
This has eliminated the hourly crashes. daqd still crashes sometimes, but only a few times per day.
However, looking at the oplev summary pages that actually use the minute trends, it looks like they're only sporadically getting succesfully written out.
Also, I was having a lot of problems with the frontends' EPICS processes dying when I would try to update the SDF table. I rebuilt all of the frontends with RCG 2.9.6, which differs from the 2.9.4 that we had been running by SDF bugfixes and an RMS calculation bugfix. The SDF procedures are much more stable now.
I have not yet discovered anything broken by this chage, and the tests I made for the last upgrade were all fine; last weeks tiny DRFPMI lock was achieved after this change.
I've made a cascaded passive 2-pole pomona box for fast ALS use, using LISO to check that it'll give the right shape when hooked up to the CM board's input stage.
First stage is a 133Ohm + 10uF cap for ~120Hz LP, second is 1.15kOhm + 47nF cap for ~3.8kHz LP. The DC gain is ~0.75, which is much better than what I was doing before. The second stage would normally make a 2.9kHz LPF on its own, but the loading of the input stage moves the corner up.
It seems the 133 Ohm resistor is a reasonable load on the output AD829 of the ALS demod board (short-circuit output current of 32mA and a series output resistor of 499Ohm). To be able to use the digitized ALSX I and the lowpassed analog version simultaneously, I had to buffer the signal with a SR560 before the pomona box, otherwise the signals looked distorted. This isn't a good long-term solution. Maybe I can used the further-buffered differential output to drive the LPF+CM board.
The LISO files used to model the filter and CM board input stage, and fit the pole frequencies are attached.
I made some attempts to get the AO path going today, but I suspect this daytime noise is just too much; the PC drive seems too irritable
Despite our best efforts, the grappa remains out of reach: the DRFPMI was not locked tonight.
We spent a fair amount of time with the AUX X laser, as it was glitching madly again.
DRMI was finicky until I found some more reliable triggering settings; namely aquiring with AS110Q, but after that transitioning the trigger to the same POP22+POPDC combo as PRCL and MICH. With this in place, the DRMI lock seems really indefinite no matter what CARM seems to do; or at least, I always lost lock due to CARM shenanigans after this.
The most frustrating part was the fact that I just couldn't cross over the AO path stably. It never "clicked" into high circulating power as it normally does (either in PRFPMI, or how it was last week). Various crossover filters and tweaks were attempted to no avail. Morning traffic starts soon, so we're calling it a night.
I carried out some more tests on the digital frequency counting system today, mainly to see if the actual performance mirrors the expected systematic errors I had calculated here.
Setup and measurement details:
I used the Fluke 6061A RF signal generator to output an RF signal at various frequencies, one at a time, between 10 and 70 MHz. I split the signal (at -15 dBm) into two parts, one for the X-channel and one for the Y-channel using a mini-circuits splitter. I then looked at the input signal using testpoints I had set up within the model, to decide what thresholds to set for the Scmitt trigger. Finally, I averaged the outputs of the X and Y channels using z avg -s 10 C1:ALS-FC_X_FREQUENCY_OUT and also looked at the standard deviation as a measure of the fluctuations in the output (these averages were taken after a low-pass filter stage with two poles at 20Hz, chosen arbitrarily).
The new ZKL-1R5 RF amplifier that Steve ordered arrived yesterday. I installed this in the frequency divider box and did a quick check using the Fluke RF signal generator and an oscilloscope to verify that both the X and Y paths were working.
I've now installed the box in the 1X2 rack where the olf "RF amplifiers for ALS and FOL" box used to sit (I swapped that out as I needed the L brackets on that chassis to mount mine, see Attachment #1 for the new layout). The power cable that used to power the old chassis was available, but the connector was of the wrong gender, so I had to switch this out. After verifying that I was getting the correct voltage (+15V), I connected it to the chassis.
I then did a quick check with the Fluke generator to make sure that all was working as expected - Eric had set up some ADC channels for me earlier today in the C1ALS model, and I copied over my frequency counting module from C1TST into C1ALS, and recompiled the model. The RF generator was set to generate a 25MHz signal at -20dBm, which I then split using an RF power splitter between the X and Y arms. I then checked the output using dataviewer - I recovered an output frequency of ~27.64 MHz with a jitter of ~0.02 MHz with a 20Hz low-pass filter in place (see Attachment #2), which looks consistent with the systematic error inherent in the zero-crossing counting algorithm and random fluctuations I had observed in my earlier trials, discussed here. But a more systematic investigation needs to be carried out in this regard. The interfacing between the hardware and software seems to be working alright though. I've left the RF generator near the 1x2 rack for now, though its powered off.
The mode cleaner unlocked quite a few times while I was working but looks stable now.
A few minutes ago, Gautam and I were poking around the IOO rack, looking at where he should power his frequency divider box, and what ADC innputs to use.
Looking at the mode cleaner signals, it looks like we may have jostled something in a good way. Weird.
Ah, I understand it now! Since the additive offset path keeps the post-cavity frequency TF flat, the pre-cavity frequency must grow above the cavity pole, which is why ALS sees a zero.
Ok, so this means we want to apply two lowpasses to the ALS signal for use as fast CARM control, if we want it to be capable of scalar blending with REFL11: one at ~120Hz to imitate the CARM coupled cavity pole present in REFL11, and one at ~3.8kHz to undo the "IMC cavity zero" present in ALS.
At this point, I'm starting to prefer an active circuit to do this lowpassing; using LISO to check designs for two cascaded passive LPFs it looks like the ALS signal would have to be attenuated by a factor of ~20 at DC if we don't use resistors smaller than 1k, given the low input impedence of the CM board.
ALS is the comparison of the PSL laser freq vs the end laser freq that is locked to the arm cavity resonant freq
On the other hand, the AS55 PDH is the comparison of the PSL laser freq after the IMC vs the arm cavity resonant freq. Here the PDH signal involves the arm cavity pole.
In total you observe the difference by the IMC cav pole + the arm cav pole.
To get a better look at how to do fast ALS, I took some "Plant TF" measurements of the X arm.
Specifically, in single arm POX lock and the both Y TMs misaligned, I used the SR785 to inject into EXC B of the common mode board with the CM fast output gain and IMC IN2 gain both at 0dB, and looked at the transfer function of that excitation into the analog ALSX I and AS55 Q out-of-loop signals. (ALSX I tuned to a zero crossing via the delay line box as usual.)
My expectation was to see them only differ by the IR single arm cavity pole, which should be around 8-9kHz ( FSR/450 = 3.9MHz/450 ~ 8.6kHz). The green cavity pole at ~18k shouldn't show up since we're not touching the green light, and the IMC pole at ~3.8kHz shouldn't show up since this is well within the IMC loop bandwidth and we're actuating on its error point.
Instead, I see them differ by a double pole at 4.3kHz. (or zero, if you look at it the reciprocal way). Vectfit actually fits them as a slightly complex pair, with a Q of 0.53/ I imagine that the wiggles are due to the digital control loop.
My question is: why is there a double zero here? Where has my reasoning led me astray?
At 10:02AM, the N2 Pressure fell below 60 PSI. The watch script saw this happen, but I did not recieve the email it is supposed to send
C1:Vac-P1_pressure reads 7e-4, which is the same as it has for the past ~2 days, so the V1 interlock worked fine.
I've put some fresh N2 into the system, and Bob will pop in over the weekend to check it. I'll stay on top of it until Steve gets back.
After consulting ELOGs and the 40m wiki, I reasoned it was ok to open the V1 to reconnect the turbo pump to the main IFO volume and VM1 to reconnect the RGA, and have now done so.
I hope the grappa was already cold, and ready to drink!
Look upon this three second lock, ye Mighty, and rejoice!
Give us a lockloss or other kind of time series plot so we can bask in the glory.
Please clarify: I wonder if you were at the zero offset for CARM and DARM or not.
Yes, this was at the full DRFPMI resonance.
Please clarify: I wonder if you were at the zero offset for CARM and DARM or not. I am 25% excited right now.
Progress was made. CARM was stably locked on RF only. DARM was RF only for a few moments before I typed in a wrong number...
A change was made to the LSC model's triggering section to make the DRMI hold more reliably at zero CARM offset. Namely, the POPDC signal now has its absolute value taken before the trigger matrix. Even unwhitened, it occaisionally would somehow go negative enough to break the DRMI trigger.
AUX X laser was acting up again. As before, tweaking laser current is the temporary fix.
We had a look at the IR beat (PSL+Xarm) today using the new FOL fiber box, and compared it to the green beat signal for the same combination. We first switched out the green Y beat input into the RF amplifiers on the PSL table with the PSL+Xarm IR beat input (so in all the plots, the BEATY channels really correspond to the IR beat for PSL+X). The IR and green beat notes were found without much difficulty, and we compared the beat signal PSDs for the green and IR signals (see Attachment #1 - arms were locked to green and the X slow control was turned on). The pink trace (labeled REF1) corresponds to the green beat signal, and was in good agreement with an earlier reference trace Eric had saved for the same signal. The teal trace (labeled REF0) corresponds to the the IR beat signal monitored simultaneously.
We then went back to the PSL table to check the amplitude of the signal from the broadband fiber PDs using the Agilent network analyzer. An initial measurement yielded a beat note (@~50MHz) at ~-22dbm (17mV rms). We figured that by bypassing the 90-10 splitter in this path, we could get a stronger signal. But after switching out the fiber connections we found that the signal amplitude had fallen to ~-27dbm (10mV rms). As per my earlier measurements here, we expect ~600uW of light on the PD, and a quick calculation suggested the signal should be more like 60mV, so we used the fiber power meter to check the power levels after each of the couplers again. We then found that the fiber connector on the front panel of the box for the PSL input wasnt ideal (the laser power after the first 50-50 coupler was only ~250 uW, though the input was ~1.2 mW). The power after the first coupler also fluctuated unpredictably (<100 uW to 350 uW) in response to slightly tightening/loosening the fiber connections on the front panel. I then switched the PSL input to one of the two unused fiber connectors on the front panel (meant for the 10% of the beat signal for the DC readout), and found that this input behaved much better, with ~450 uW of power available after the first 50-50 coupler. The power going into the beat PD was also measured to be ~550uW, closer to what was expected. The beat signal peak now was ~-14dbm (~30mV rms).
We then once again repeated the comparison between green and IR beat signals - but while in the control room, I noticed that the beat signal amplitude on the network analyzer in the control room was fluctuating by nearly 1.5 divisions on the vertical scale - not sure what the reason for this is. A look at the PSD of the IR beat with higher power incident on the PD was also not encouraging (see blue trace in Attachment #1), it seems to have gotten worse in the 10-30 Hz range. We also looked at the coherence between the beat spectrum and the beat note amplitude in order to look for any linear coupling between the two, but from Attachment #2, we cannot explain the disparity between the green and IR beat spectra. This warrants further investigation.
Everything on the PSL table has now been restored to the configurations before these investigations (i.e. the Y+PSL green beat cable has been reconnected to the RF amplifier, and both green beat PDs have been powered back ON. The fiber PDs are powered OFF)
Highlight of the night: the DRFPMI was held at arm powers > 110 for 20 seconds. ALS feedback was still running though, but so was some nonzero REFL11 AO path action.
In short, time was spent finding the right FM trigger settings to keep the DRMI locked while CARM is fluctuating through resonance, what CARM offset to acquire DRMI lock at, order of operations of turning on AO / turning up overall CARM gain, etc.
Sadly, for the past hour or so, the DRMI has refused to stay locked for more than ~20 seconds, so I haven't been able to push things much further. This is a shame, since I'm very nearly at the equivalent point in the PRFPMI locking script where the ALS control is turned off completely.
This JDSU 1103P laser, sn P892324 lived for 2 years. It's power output is 0.05 mW now
It was replaced with brand new JDSU 1103P, sn P919645, Mfg date 12/2014 with 2.75 mW output.
There is 0.14 mW light returning to the qpd = 7,250 counts without AR 632 lenses
Gautam alerted me that the Y arm looked like it was being dithered, even though the ASS was turned off. I found that the ETMY OL signals were garbage, leading to the servos flipping back and forth between their rails.
We went out to the ETMY table, and found the HeNe laser to be emitting a paltry <0.5mW; the OL QPD could not register the puny beam incident on it.
Here is the last 30 days of OL_SUM:
Steve will replace the laser this afternoon.
RGA background scan.
The IFO is closed off from RGA with VM1
CC4 ( and CC1) is still flaky and it's interlock closes VM1
Here's an example of the glitches we've been seeing, as seen in the StripTool trace of the front end oscillator:
You can clearly see the glitch at around T = -18. Obviously during non-glitch times the sine wave is nice and cleanish (there are still the very small discretisation from the EPICS sample times).
I tried to look at fb1 again today, but still haven't made any progress.
The one thing I did notice, though, is that every hour on the hour the fb1 daqd process dies in an identical manor to how the fb daqd dies, with these:
[Sun Oct 4 12:02:56 2015] main profiler warning: 0 empty blocks in the buffer
errors right as/after it tries to write out the minute trend frames.
This makes me think that this new hardware isn't actually going to fix the problem we've been seeing with the fb daqd, even if we do get daqd "working" on fb1 as well as it's currently working on fb.
I've finished, for now, the CDS network tests that I was conducting. Everything should be back to normal.
What I did:
I wanted to see if I could make the EPICS glitches we've been seeing go away if I unplugged everything from the CDS martian switch in 1X6 except for:
What I unplugged were things like megatron, nodus, the slow computers, etc. The control room workstations were still connected, so that I could monitor.
I then used StripTool to plot the output of a front end oscillator that I had set up to generate a 0.1 Hz sine wave (see elog 11662). The slow sine wave makes it easy to see the glitches, which show up as flatlines in the trace.
More tests are needed, but there was evidence that unplugging all the extra stuff from the switch did make the EPICS glitches go away. During the duration of the test I did not see any EPICS glitches. Once I plugged everything back in, I started to see them again. However, I'm currently not seeing many glitches (with everything plugged back in) so I'm not sure what that means. I think more tests are needed. If unplugging everything did help, we still need to figure out which machine is the culprit.
I've taken over one of the SENSMAT oscillators for a test of the EPICS system.
These are the channels I've modified, with their original and current settings:
controls@donatella|~ > caget C1:LSC-OUTPUT_MTRX_7_13 C1:CAL-SENSMAT_CARM_OSC_FREQ C1:CAL-SENSMAT_CARM_OSC_CLKGAIN
controls@donatella|~ > caget C1:LSC-OUTPUT_MTRX_7_13 C1:CAL-SENSMAT_CARM_OSC_FREQ C1:CAL-SENSMAT_CARM_OSC_CLKGAIN
I'm about to start conducting some tests on the CDS network. Things will probably be offline for a bit. Will post when things are back to normal.
Gautom has received 40m specific basic safety training today.
Cable numbered #53 from Accelerometer 4 to 1X7 / DAQ input c26 was squased while removing network card from Sun Fire x4600 today.
This cable has to be tested.
I've been using an SR560 to experiment with differnent pole frequencies, to try and cancel the mystery zero. It's after the ALS demod board, before the pomona LPF with a gain of five.
A pole frequency of 3kHz seems to recover sensible loop shapes. I've been able to crossover the AO path to make a nice long phase bubble which isn't the prettiest, but seems workable.
Getting to this point is now almost entirely scripted and repeatable; one just has to make sure that the ALS beat has the correct sign and adjust the delay line length. Most frustratingly, due to the dependence of the ALS gain on beat frequency / magnitude / delay, which can all vary on the order of a few dB, the AO gain settings to get to the crossed over point are not always the same, so at the end it's a lot of small steps and frequent loop measurements.
The FSS crossover and overall IMC loop gain have to be pretty actively managed too. It's all too easy to drive the pockel's cell crazy. And if it's going crazy on its own anyways, there's no hope in trying to pile ALS sensing noise on top of it... It would really help in this effort to fix the whole PC situation up.
Unfortunately, lock is lost when increasing the overall gain on the common mode board even by 1dB. We've seen in the single arm tests, that the gain settings have an appreciable difference in offset between them. Maybe this step is more than what the loop can handle? Or maybe it's the voltage glitches... Maybe some gain reallocation can put me on a region of the slider that glitches less.
In terms of the mystery plant features, I figure I'd like to take the analog TF of AO control signal to, say, AS55, and see what may or may not be there. I just haven't done this tonight since it would involve recabling the analyzer, and I still need frequent loop measurements to get to the crossed over state. Having ITMY misaligned and using the digital AS55Q spectrum as an out of loop monitor has been very helpful.
Swapping between fb and fb1 as DAQ is very straightforward, now that they are both on the DAQ network:
Once daqd starts, the front end mx_stream processes will be restarted by their monits, and be pointing to the new location.
Moving back is just reversing those steps.
I just realized that when running fb1, if a single mx_stream dies they all die.
I've not really been able to make additional progress with the new 'fb1' DAQ. It's still flaky as hell. Therefore we're still using old 'fb'.
The mx_stream processes on the front ends initially run fine, connecting to the daqd and transferring data, with both DAQ-..._STATUS and FE-..._FB_NET_STATUS indicators green. Then after about two minutes all the mx_stream processes on all the front ends die. Monit eventually restarts them all, at which point they come up green for a while until the crash again ~2 minutes later. This is essentially the same situation as reported previously.
In the daqd logs when the mx_streams die:
Aborted 2 send requests due to remote peer 00:30:48:be:11:5d (c1iscex:0) disconnected
Aborted 2 send requests due to remote peer 00:14:4f:40:64:25 (c1ioo:0) disconnected
Aborted 2 send requests due to remote peer 00:30:48:d6:11:17 (c1iscey:0) disconnected
Aborted 2 send requests due to remote peer 00:25:90:0d:75:bb (c1sus:0) disconnected
Aborted 1 send requests due to remote peer 00:30:48:bf:69:4f (c1lsc:0) disconnected
mx_wait failed in rcvr eid=000, reqn=176; wait did not complete; status code is Remote endpoint is closed
disconnected from the sender on endpoint 000
mx_wait failed in rcvr eid=000, reqn=177; wait did not complete; status code is Connectivity is broken between the source and the destination
disconnected from the sender on endpoint 000
mx_wait failed in rcvr eid=000, reqn=178; wait did not complete; status code is Connectivity is broken between the source and the destination
disconnected from the sender on endpoint 000
mx_wait failed in rcvr eid=000, reqn=179; wait did not complete; status code is Connectivity is broken between the source and the destination
disconnected from the sender on endpoint 000
mx_wait failed in rcvr eid=000, reqn=180; wait did not complete; status code is Connectivity is broken between the source and the destination
disconnected from the sender on endpoint 000
[Thu Oct 1 19:00:09 2015] GPS MISS dcu 39 (PEM); dcu_gps=1127786407 gps=1127786425
[Thu Oct 1 19:00:09 2015] GPS MISS dcu 39 (PEM); dcu_gps=1127786408 gps=1127786426
In the mx_stream logs:
controls@c1iscey ~ 0$ /opt/rtcds/caltech/c1/target/fb/mx_stream -r 0 -W 0 -w 0 -s 'c1x05 c1scy c1tst' -d fb1:0
mmapped address is 0x7f0df23a6000
mmapped address is 0x7f0dee3a6000
mmapped address is 0x7f0dea3a6000
send len = 263596
isendxxx failed with status Remote Endpoint Unreachable
disconnected from the sender
While the mx_stream processes are running daqd seems to write out data just fine. At least for the full frames. I manually verified that there is indeed data in the frames that are written.
Eventually, though, daqd itself crashes with the same error that we've been seeing:
main profiler warning: 0 empty blocks in the buffer
I'm not exactly sure what the crashes are coincident with, but it looks like they are also coincident with the writing out of the minute and/or second trend files. It's unclear how it's related to the mx_stream crashes, if at all. The mx_stream crashes happen every couple of minutes, whereas the daqd itself crashes much less frequently.
The new daqd can't handle EDCU files. If an EDCU file is specified (e.g. C0EDCU.ini in our case), the daqd will segfault very soon after startup. This was an issue with the current daqd on fb, but was "fixed" by moving where the EDCU file was specified in the master file.
There are a number of differences between the fb1 and fb configurations:
It's possible those differences could account for the problems (/opt/rtapps/epics incompatible with this Debian install, for instance). Somehow I doubt it. I wonder if all the weird network issues we've been seeing are somehow involved. If the NFS mount of chiara is problematic for some reason that would affect everything that mounts it, which includes all the front ends and fb/fb1.
There are two things to try:
Gautam and Steve,
The decommissioned server from LDAS is retired to the 40m with 32 cores and 128GB of memory in rack 1X7 http://docs.oracle.com/cd/E19121-01/sf.x4600/
I got Steve to get us a new Myrinet fiber network adapter card for fb1:
I just finished installing the card in fb1, and it came up fine. We happened to have a spare fiber, and a spare fiber jack in the DAQ switch, so I went ahead and plugged it in in parallel to the old fb:
controls@fb1:~/rtbuild/trunk 130$ /opt/mx/bin/mx_info
MX Version: 1.2.16
MX Build: controls@fb1:/opt/src/mx-1.2.16 Fri Sep 18 18:32:59 PDT 2015
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
Instance #0: 364.4 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
Status: Running, P0: Link Up
Network: Ethernet 10G
MAC Address: 00:60:dd:43:74:62
Product code: 10G-PCIE-8B-S
Part number: 09-04228
Serial number: 485052
Mapper: 00:60:dd:46:ea:ec, version = 0x00000000, configured
Mapped hosts: 7
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- ---
0) 00:60:dd:43:74:62 fb1:0 1,0
1) 00:25:90:0d:75:bb c1sus:0 1,0
2) 00:30:48:be:11:5d c1iscex:0 1,0
3) 00:30:48:d6:11:17 c1iscey:0 1,0
4) 00:30:48:bf:69:4f c1lsc:0 1,0
5) 00:14:4f:40:64:25 c1ioo:0 1,0
6) 00:60:dd:46:ea:ec fb:0 1,0
We can now work on fb1 while fb continues to run and collect data from the front ends.
I'm still not getting the mx_stream connections to the new fb1 daq to work. I'm leaving everything running as is on fb for the moment.
Eric pointed out that the 1x2 couplers that were used in the previous arrangement and which I recycled, were in fact NOT appropriate - they are not 50-50 couplers but 90-10 couplers, which explains the measured power levels I quoted here.
I switched out these for a pair of the newly arrived 2x2 couplers, and have also replaced the datasheets on the inside of the top cover. I then redid the power level measurements, and got some sensible values this time (see Attachment #1 for revised layout and measured power levels, numbers in red are powers for PSL light, numbers in green are for AUX laser light, and all numbers are in mW). I did find that the 90-10 splitter in the PSL+Y path was not working (though the one in the PSL+X path seems to be working fine), and hence, have not quoted power levels at the output of these splitters. For now, I guess we can bypass the splitters and take the PSL+AUX light from the 2x2 couplers directly to the PDs.