Reboots for c1susaux, c1iscaux today.
I've been making NBs on my laptop, thought I would get the copy under version control up-to-date since I've been negligent in doing so.
The code resides in /ligo/svncommon/NoiseBudget, which as a whole is a git directory. For neatness, most of Evan's original code has been put into the sub-directory /ligo/svncommon/NoiseBudget/H1NB/, while my 40m NB specific adaptations of them are in the sub-directory /ligo/svncommon/NoiseBudget/NB40. So to make a 40m noise budget, you would have to clone and edit the parameter file accordingly, and run python C1NB.py C1NB_2017_04_30.py for example. I've tested that it works in its current form. I had to install a font package in order to make the code run (with sudo apt-get install tex-gyre ), and also had to comment out calls to GwPy (it kept throwing up an error related to the package "lal", I opted against trying to debug this problem as I am using nds2 instead of GwPy to get the time series data anyways).
There are a few things I'd like to implement in the NB like sub-budgets, I will make a tagged commit once it is in a slightly neater state. But the existing infrastructure should allow making of NBs from the control room workstations now.
We spent some time trying to get the noise-budgeting code running today. I guess eventually we want this to be usable on the workstations so we cloned the git repo into /ligo/svncommon. The main objective was to see if we had all the dependencies for getting this code running already installed. The way Evan has set the code up is with a bunch of dictionaries for each of the noise curves we are interested in - so we just commented out everything that required real IFO data. We also commented out all the gwpy stuff, since (if I remember right) we want to be using nds2 to get the data.
Running the code with just the gwinc curves produces the plots it is supposed to, so it looks like we have all the dependencies required. It now remains to integrate actual IFO data, I will try and set up the infrastructure for this using the archived frame data from the 2016 DRFPMI locks..
I captured a few images of the beam spot on ETMX at 5ms, 10ms, 14ms, 50ms, 100ms, 500ms, 1000ms exposure and ran them through my python script for HDR images. Here's what I obtained.
The resulting image is an improvement over the highly saturated images at say, 500ms and 1 second exposures.
Additionally, I also included a colormapped version of the image.
i wonder how 'HDR' these images really are. is there a quantitative way to check that we are really getting more bits? also, how many bits does the PNG format allow for monochrome images? i worry that these elog images are already lossy.
About 2 weeks ago, I noticed some odd behaviour of the LSC TRY data stream. Its DC value seems to be drifting ~10x more than TRX. Both signals come from the transmission QPDs. At the time, we were dealing with various CDS FE issues but things have been stable on that end for the last two weeks, so I looked into this a bit more today. It seems like one particular channel is bad - Quadrant 4 of the ETMY TRANS QPD. Furthermore, there is a bump around 150Hz, and some features above 2kHz, that are only present for the ETMY channels and not the ETMX ones.
Since these spectra were taken with the PSL shutter closed and all the lab room lights off, it would suggest something is wrong in the electronics - to be investigated.
The drift in TRY can be as large as 0.3 (with 1.0 being the transmitted power in the single arm lock). This seems unusually large, indeed we trigger the arm LSC loops when TRY > 0.3. Attachment #2 shows the second trend of the TRX and TRY 16Hz EPICS channels for 1 day. In the last 12 hours or so, I had left the LSC master switch OFF, but the large drift of the DC value of TRY is clearly visible.
In the short term, we can use the high-gain THORLABS PD for TRY monitoring.
Indeed, the whole point of the high/low gain setup is to never use the QPDs for the single arm work. Only use the high gain Thorlabs PD and then the switchover code uses the QPD once the arm powers are >5.
I don't know how the operation procedure went so higgledy piggledy.
Attachment #1: State of CDS overview screen as of 9.30AM today morning when I came in.
Looks like there may have bene a power glitch, although judging by the wall StripTool traces, if there was one, it happened more than 8 hours ago. FB is down atm so can't trend to find out when this happened.
All FEs and FB are unreachable from the control room workstations, but Megatron, Optimus and Chiara are all ssh-able. The latter reports an uptime of 704 days, so all seems okay with its UPS. Slow machines are all responding to ping as well as telnet.
Recovery process to begin now. Hopefully it isn't as complicated as the most recent effort [FAMOUS LAST WORDS]
I am unable to get FB to reboot to a working state. A hard reboot throws it into a loop of "Media Test Failure. Check Cable".
Jetstor RAID array is complaining about some power issues, the LCD display on the front reads "H/W Monitor", with the lower line cycling through "Power#1 Failed", "Power#2 Failed", and "UPS error". Going to 192.168.113.119 on a martian machine browser and looking at the "Hardware information" confirms that System Power #1 and #2 are "Failed", and that the UPS status is "AC power loss". So far I've been unable to find anything on the elog about how to handle this problem, I'll keep looking.
In fact, looks like this sort of problem has happened in the past. It seems one power supply failed back then, but now somehow two are down (but there is a third which is why the unit functions at all). The linked elog thread strongly advises against any sort of power cycling.
Over the day, I have been working on a C++ program to interface with Pylon to capture images and reduce dependence on the Pylon GUI. The program uses the Pylon header files along with opencv headers. While ultimately a wrapper in python may be developed for the program, the current C++ program at,
/users/jigyasa/GigEcode/Grab/Grab.cpp when compiled as
g++ -Wl,--enable-new-dtags -Wl,-rpath,/opt/pylon5/lib64 -o Grab Grab.o -L/opt/pylon5/lib64 -Wl,-E -lpylonbase -lpylonutility -lGenApi_gcc_v3_0_Basler_pylon_v5_0 -lGCBase_gcc_v3_0_Basler_pylon_v5_0 `pkg-config opencv --cflags --libs`
returns an executable file named Grab which can be executed as ./Grab
This captures one image from the camera and displays it, additionally it also displays the gray value of the first pixel.
I am working on adding more utility to the program such as manually adjusting exposure, gain and also on the python wrapper (Cython has been installed locally on Ottavia for the purpose)!
A bit more digging on the diagnostics page of the RAID array reveals that the two power supplies actually failed on Jun 2 2017 at 10:21:00. Not surprisingly, this was the date and approximate time of the last major power glitch we experienced. Apart from this, the only other error listed on the diagnostics page is "Reading Error" on "IDE CHANNEL 2", but these errors precede the power supply failure.
Perhaps the power supplies are not really damaged, and its just in some funky state since the power glitch. After discussing with Jamie, I think it should be safe to power cycle the Jetstor RAID array once the FB machine has been powered down. Perhaps this will bring back one/both of the faulty power supplies. If not, we may have to get new ones.
The problem with FB may or may not be related to the state of the Jestor RAID array. It is unclear to me at what point during the boot process we are getting stuck at. It may be that because the RAID disk is in some funky state, the boot process is getting disrupted.
After a couple of minutes, the front LCD display seemed to indicate that it had finished running some internal checks. The messages indicating failure of power units, which was previously constantly displayed on the front LCD panel, was no longer seen. Going back to the control room and checking the web diagnostics page, everything seemed back to normal.
It's possible the fb bios got into a weird state. fb definitely has it's own local boot disk (*not* diskless boot). Try to get to the BIOS during boot and make sure it's pointing to it's local disk to boot from.
If that's not the problem, then it's also possible that fb's boot disk got fried in the power glitch. That would suck, since we'd have to rebuild the disk. If it does seem to be a problem with the boot disk then we can do some invasive poking to see if we can figure out what's up with the disk before rebuilding.
I think this is the boot disk failure. I put the spare 2.5 inch disk into the slot #1. The OK indicator of the disk became solid green almost immediately, and it was recognized on the BIOS in the boot section as "Hard Disk". On the contrary, the original disk in the slot #0 has the "OK" indicator kept flashing and the BIOS can't find the harddisk.
Jamie suggested verifying that the problem is indeed with the disk and not with the controller, so I tried switching the original boot disk to Slot #1 (from Slot #0 where it normally resides), but the same problem persists - the green "OK" indicator light keeps flashing even in Slot #1, which was verified to be a working slot using the spare 2.5 inch disk. So I think it is reasonable to conclude that the problem is with the boot disk itself.
The disk is a Seagate Savvio 10K.2 146GB disk. The datasheet doesn't explicitly suggest any recovery options. But Table 24 on page 54 suggests that a blinking LED means that the disk is "spinning up or spinning down". Is this indicative of any particular failure moed? Any ideas on how to go about recovery? Is it even possible to access the data on the disk if it doesn't spin up to the nominal operating speed?
If we have a SATA/USB adapter, we can test if the disk is still responding or not. If it is still responding, can we probably salvage the files?
Chiara used to have a 2.5" disk that is connected via USB3. As far as I know, we have remote and local backup scripts running (TBC), we can borrow the USB/SATA interface from Chiara.
If the disk is completely gone, we need to rebuilt the disk according to Jamie, and I don't know how to do it. (Don't we have any spare copy?)
Seems like the connector on this particular disk is of the SAS variety (and not SATA). I'll ask Steve to order a SAS to USB cable. In the meantime I'm going to see if the people at Downs have something we can borrow.
If we have a SATA/USB adapter, we can test if the disk is still responding or not. If it is still responding, can we probably salvage the files?
Chiara used to have a 2.5" disk that is connected via USB3. As far as I know, we have remote and local backup scripts running (TBC), we can borrow the USB/SATA interface from Chiara.
I couldn't find an external docking setup for this SAS disk, seems like we need an actual controller in order to interface with it. Mike Pedraza in Downs had such a unit, so I took the disk over to him, but he wasn't able to interface with it in any way that allows us to get the data out. He wants to try switching out the logic board, for which we need an identical disk. We have only one such spare at the 40m that I could locate, but it is not clear to me whether this has any important data on it or not. It has "hda RTLinux" written on its front panel with a sharpie. Mike thinks we can back this up to another disk before trying anything, but he is going to try locating a spare in Downs first. If he is unsuccessful, I will take the spare from the 40m to him tomorrow, first to be backed up, and then for swapping out the logic board.
Chatting with Jamie and Koji, it looks like the options we have are:
I just want to mention that the situation is actually much more dire than we originally thought. The diskless NFS root filesystem for all the front-ends was on that fb disk. If we can't recover it we'll have to rebuilt the front end OS as well.
As of right now none of the front ends are accessible, since obviously their root filesystem has disappeared.
Keith Thorne sent us two disks: one has the daqd code and the second is the boot disk for the FE machines. Since Jamie managed to successfully compile the daqd code on FB1 yesterday, we decided to try the following: mount the boot disk KT sent us (using a SATA/USB adapter) on /mnt on FB1, get the FEs booted up, and restart the RT models.
While on FB1, Jamie realized he actually had a copy of the /diskless/root directory, which is the NFS filesystem for the FEs, on FB1. So we decided to try and boot some of the FEs with this (instead of starting from scratch with the disks KT sent us). The way things were set up, the FEs were querying the FB machine as the DHCP server. But today, we followed the instructions here to get the FEs to get their IP address from chiara instead. We also added the line
to /etc/exports followed by exportfs -ra on FB1. At which point the FE machine we were testing (c1lsc) was able to boot up.
However, it looks like the NFS filesystem isn't being mounted correctly, for reasons unknown. We commented out some of the rtcds related lines in /etc/rc.local because they were causing a whole bunch of errors at boot (the lines that were touched have been tagged with today's date).
So in summary, the status as of now is:
We will resume recovery efforts on Monday.
This evening, Gautam helped me with setting up the apparatus for calibrating the GigE for BRDF measurements.
The SP table was chosen to set up the experiment and for this reason a few things including a laser and power meter (presumably set up by Steve) had to be moved around.
We initially started by setting up the Crysta laser with its power source (Crysta #2, 150-190 mW 1064 laser) on the SP table. The Ophir power meter was used to measure the laser power. We discovered that the laser was highly unstable as its output on the power meter fluctuated (kind of periodically) between 40 and 150 mW. The beam spot on the beam card also appeared to validate this change in intensity. So we decided to use another 1064 nm laser instead.
Gautam got the LightWave NPro laser from the PSL table and set it up on the SP table and with this laser the output as measured by the same power meter was quite stable.
We manually adjusted the power to around 150 mW. This was followed by setting up the half wave plate(HWP) with the polarizing beam splitter (PBS), which was very gently and precisely done by Gautam, while explaining how to handle the optics to me.
On first installing the PBS, we found that the beam was already quite strongly polarized as there seemed to be zero transmission but a strong reflection.
With the HWP in place, we get a control over the transmitted intensity. The reflected beam is directed to a beam dump.
I have taken down the GigE(+mount) at ETMX and wired a spare PoE injector.
We tried to interface with the camera wirelessly through the wireless network extenders but that seems to render an unstable connection to the GigE so while a single shot works okay, a continuous shot on the GigE didn’t succeed.
The GigE was connected to the Martian via Ethernet cable and images were observed using a continuous shot on the Pylon Viewer App on Paola.
We deliberated over the need of a beam expander, but it has been omitted presently. White printer paper is currently being used to model the Lambertian scatterer. So light scattered off the paper was observed at a distance of about 40 cm from the sample.
While proceeding with the calibrations further tonight, we realized a few challenges.
While the CCD is able to observe the beam spot perfectly well, measuring the actual power with the power meter seems to be tricky. As the scattered power is quite low, we can’t actually see any spot using a beam card and hence can’t really ensure if we are capturing the entire beam spot on the active region of the power meter (placed at a distance of ~40cm from the paper) or if we are losing out on some light, all the while ensuring that the power meter and the CCD are in the same plane.
We tried to think of some ways around that, the description of which will follow. Any ideas would be greatly appreciated.
Thanks a ton for all your patience and help Gautam! :)
More to follow..
Power meter only needed to measure power going into the paper not out. We use the BRDF of paper to estimate the power going out given the power going in.
Some days ago, I stumbled upon this github page, by a grad student at KIT who developed this code as he was working with Basler GigE cameras. Since we are having trouble installing SnapPy, I figured I'd give this package a try. Installation was very easy, took me ~10mins, and while there isn't great documentation, basic use is very easy - for instance, I was able to adjust the exposure time, and capture an image, all from Pianosa. The attached is some kind of in-built function rendering of the captured image - it is a piece of paper with some scribbles on it near Jigyasa's BRDF measurement setup on the SP table, but it should be straightforward to export the images in any format we like. I believe the axes are pixel indices.
Of course this is only a temporary solution as I don't know if this package will be amenable to interfacing with EPICS servers etc, but seems like a useful tool to have while we figure out how to get SnapPy working. For instance, the HDR image capture routine can now be written entirely as a Python script, and executed via an MEDM button or something.
A rudimentary example file can be found at /opt/rtcds/caltech/c1/scripts/GigE/PyPylon/examples - some of the dictionary keywords to access various properties of the camera (e.g. Exposure time) are different, but these are easy enough to figure out.
From what I understood froom my reading, [Large-angle scattered light measurements for quantum-noise filter cavity design studies(Refer https://arxiv.org/abs/1204.2528)], we do the white paper test in order to calibrate for the radiometric response, i.e. the response of the CCD sensor to radiance.‘We convert the image counts measured by the CCD camera into a calibrated measure of scatter. To do this we measure the scattered light from a diffusing sample twice, once with the CCD camera and once with a calibrated power meter. We then compare their readings.’
But thinking about this further, if we assume that the BRDF remains unscaled and estimate the scattered power from the images, we get a calibration factor for the scattered power and the angle dependence of the scattered power!
With this idea in mind, we can now actually take images of the illuminated paper at different scattering angles, assume BRDF is the constant value of (1/pi per steradian),
then scattered power Ps= BRDF * Pi cosθ * Ω, where Pi is the incident power, Ω is the solid angle of the camera and θ is the scattering angle at which measurement is taken. This must also equal the sum of pixel counts divided by the exposure time multiplied by some calibration factor.
From these two equations we can obtain the calibration factor of the CCD. And for further BRDF measurements, scale the pixel count/ exposure by this calibration factor.
Bluebean Optical Tech Limited of Shanghai delivered 50 pieces red ruby prisms with radius. The first prism pictures were taken at June 5
and it was retaken again as BB#1 later
More samples were selected randomly as one from each bag of 5 and labeled as BB#2.......6
The R10 mm radius can be seen agains the ruler edge. The v-groove edge was labeled with blue marker and pictures were taken
from both side of this ridge. The top view is shown as the wire laying across on it.
SOS sus wire of 43 micron OD used as calibration as it was placed close to the side that it was focused on.
The V-groove ridge surface quality was evaluated based on as scale of 1 – 10 with 10 being the most positive.
Remaining thing to examin, take picture of the contacting ridge to SOS from the side.
I've been working on improving the 40m FINESSE model I set up sometime last year (where the goal was to model various RC folding mirror scenarios). Specifically, I wanted to get the locking feature of FINESSE working, and also simulate the DRMI (no arms) configuration, which is what I have been working on locking the real IFO to. This elog is a summary of what I have from the last few days of working on this.
GV Edit: EQ pointed out that my method of taking the slope of the error signal to compute the sensing element isn't the most robust - it relies on choosing points to compute the slope that are close enough to the zero crossing and also well within the linear region of the error signal. Instead, FINESSE allows this computation to be done as we do in the real IFO - apply an excitation at a given frequency to an optic and look at the twice-demodulated output of the relevant RFPD (e.g. for PRCL sensing element in the 1f DRMI configuration, drive PRM and demodulate REFL11 at 11MHz and the drive frequenct). Attachment #4 is the sensing matrix recomputed in this way - in this case, it produces almost identical results as the slope method, but I think the double-demod technique is better in that you don't have to worry about selecting points for computing the slope etc.
After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ). The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO. We're trying to get the front ends working first, and will work on recovering daqd after.
Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image. We setup fb1 as the new boot server, and were able to get front ends booting again. Unfortunately, we've been having trouble running and building models, so something is still amis. We've been taking a three-pronged approach to getting the front ends running:
It seems that in all cases we need to rebuild the dolphin drivers from source.
To clarify, we're able to boot the x1boot image with the existing 2.6.25 kernel that we have from fb. The issue with the root.x1boot image is not the kernel version but some of the other support libraries, such as dolphin.
I'll try to get the first two of those done tomorrow, although it's unclear what model updates we'll have to do to get things working with the newer RCG.
All suspensions are damped:
It should be possible at this point to do more recovery, like locking the MC.
Some details on the restore process:
The daqd is not yet running. This is the next task.
I have been taking copious notes and will fully document the restore process once complete.
c1ioo has been giving us a little bit of trouble. The c1ioo model kept crashing and taking down the whole c1ioo host. We found a red light on one of the ADCs (ADC1). We pulled the card and replaced it with a spare from the CDS cabinet. That seemed to fix the problem and c1ioo became more stable.
We've still been seeing a lot of glitching in c1ioo, though, with CPU cycle times frequently (every couple of seconds) running above threshold for all models, up to 200 us. I tried unloading every kernel module I could and shutting down every non-critical process, but nothing seemed to help.
We eventually tried stopping the c1ioo model altogether and that seemed to help quite a bit, dropping the long cycle rate down to something like one every 30 seconds or so. Not sure what that means. We should look into the BIOS again, to see if there could be something interacting with the newer kernel.
So currently the c1ioo model is not running (which is why it's all white in the CDS overview snapshot above). The fact that c1ioo is not running and the remaining models are still occaissionly glitching is also causing various IPC errors on auxilliary models (see c1mcs, c1rfm, c1ass, c1asx).
the new RCG tries to do more checks on custom c code, but it seems to be having trouble finding our custom "ccodeio.h" files that live with the c definitions in USERAPPS/*/common/src/. Unclear why yet. This is causing the RCG to spit out warnings like the following:
Cannot verify the number of ins/outs for C function BLRMS.
File is /opt/rtcds/userapps/release/cds/c1/src/BLRMSFILTER.c
Please add file and function to CDS_SRC or CDS_IFO_SRC ccodeio.h file.
This are just warnings and will not prevent the model form compiling or warning. We'll figure out what the problem is to make these go away, but they can be ignored for the time being.
Probably the worst problem we're facing right now is an instability that will occaissionally, but not always, cause the entire front end host to freeze up upon unloading an RTS kernel module. This is a known issue with the newer linux kernels (we're using kernel version 3.2.35), and is being looked into.
This is particularly annoying with the machines on the dolphin network, since if one of the dolphin hosts goes down it manages to crash all the models reading from the dolphin network. Since half the time they can't be cleanly restarted, this tends to cause a boot fest with c1sus, c1lsc, and c1ioo. If this happens, just restart those machines, wait till they've all fully booted, then restart all the models on all hosts with "rtcds start all".
"rtcds start all"
This morning, all the c1iscex models were dead. Attachment #1 shows the state of the cds overview screen when I came in. The machine itself was ssh-able, so I just restarted all the models and they came back online without fuss.
This was me. I had rebooted that machine and hadn't restarted the models. Sorry for the confusion.
At around 10:30AM today morning, the PSL mysteriously shut off. Steve and I confirmed that the NPRO controller had the RED "OFF" LED lit up. It is unknown why this happened. We manually turned the NPRO back on and hte PMC has been stably locked for the last hour or so.
There are so many changes to lab hardware/software that have been happening recently, it's not entirely clear to me what exactly was the problem here. But here are the observations:
Steve says that this kind of behaviour is characteristic of a power glitch/surge, but nothing else seems to have been affected (I confirmed that the X and Y end lasers are ON).
Today I got the mx/open-mx networking working for the front ends. This required some tweaking to the network interface configuration for the diskless front ends, and recompiling mx and open-mx for the newer kernel. Again, this will all be documented.
controls@fb1:~ 0$ /opt/mx/bin/mx_info
MX Version: 1.2.16
MX Build: root@fb1:/opt/src/mx-1.2.16 Mon Jul 24 11:33:57 PDT 2017
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
Instance #0: 364.4 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
Status: Running, P0: Link Up
Network: Ethernet 10G
MAC Address: 00:60:dd:43:74:62
Product code: 10G-PCIE-8B-S
Part number: 09-04228
Serial number: 485052
Mapper: 00:60:dd:43:74:62, version = 0x00000000, configured
Mapped hosts: 6
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- ---
0) 00:60:dd:43:74:62 fb1:0 1,0
1) 00:30:48:be:11:5d c1iscex:0 1,0
2) 00:30:48:bf:69:4f c1lsc:0 1,0
3) 00:25:90:0d:75:bb c1sus:0 1,0
4) 00:30:48:d6:11:17 c1iscey:0 1,0
5) 00:14:4f:40:64:25 c1ioo:0 1,0
I also checked the BIOS on c1ioo and found that the serial port was enabled, which is known to cause timing glitches. I turned off the serial port (and some power management stuff), and rebooted, and all the c1ioo timing glitches seem to have gone away.
It's unclear why this is a problem that's just showing up now. Serial ports have always been a problem, so it seems unlikely this is just a problem with the newer kernel. Could the BIOS have somehow been reset during the power glitch?
In any event, all the front ends are now booting cleanly, with all dolphin and mx networking coming up automatically, and all models running stably:
Now for daqd...
Now that all the front end models are running, I re-aligned the IMC, locked it manually, and then tweaked the alignment some more. The IMC transmission now is hovering around 15300 counts. I re-enabled the Autolocker and FSS Slow loops on Megatron as well.
Currently, I am unable to engage the coil-dewhitening filters without destroying cavity locks. One reason why this is so is because the present Oplev servos have a roll-off at high frequencies that is not steep enough - engaging the digital whitening + analog de-whitening just causes the DAC output to saturate. Today, Rana and I discussed some ideas about how to approach this problem. This elog collects these thoughts. As I flesh out these ideas, I will update them in a more complete writeup in T1700363 (placeholder for now). Past relevant elogs: 5376, 9680.
Before the CDS went down, I had taken error signal spectra for the ITMs. I will update this elog tomorrow with these measurements, as well as some noise estimates, to get started.
The RGA did not shut down at the turbo pump controller failing.
Ifo pressure was 5.5 mTorr this morning. The PSL shutter was still open. TP2 controller failed. Interlock closed V1, V4 and VM1
Turbo pump 2 is the fore pump of the Maglev. The pressure here was 3.9 Torr so The Magelv got warm ~38C but it was still rotating at 560 Hz normal with closed V1
What I did:
Looked at pressures of Hornet and Super Bee Instru Tech. Inc
Closed all annuloses and VA6, disconnected V4 and VA6 and turned on external fan to cool Maglev
Opened V7 to pump the Maglev fore line with TP3
V1 opened manually when foreline pressure dropped to <2mTorr at P2 and the body temp of the Maglev cooled down to 25-27 C
VM1 opened at 1e-5 Torr
Valve configuration: vacuum normal with annuloses not pumped
Ifo pressure 8.5e-6 Torr -IT at 10am, P2 foreline pressure 64 mTorr, TP3 controller 0.17A 22C 50Krpm
note: all valves open manually, interlock can only close them
While walking down to the X end to reset c1iscex I heard what I would call a "rythmic squnching" sound coming from under the turbo pump. I would have said the sound was coming from a roughing pump, but none of them are on (as far as I can tell).
Steve maybe look into this??
PS: please call me next time you see the vacuum is not Vacuum Normal
Gautam and Steve,
Spare Varian turbo-V 70 controller, Model 969-9505, sn 21612 was swapped in. It is running the turbo fine @ 50Krpm but it does not allow it's V4 valve to be opened............
It turns out that TP2 @ 75Krpm will allow V4 to open and close. This must be a software issue.
So Vacuum Normal is operational if TP2 is running 75,000 rpm
We want to run at 50,000 rpm on the long term.
Note: the RS232 Dsub connector on the back of this controller is mounted 180 degrees opposite than TP3 and old failed TP2 controller
PS: controller is shipping out for repair 7-28-2017
Kira Dubrovina and Naomi Wharton received 40m specific basic safety training.
I recompiled daqd on the updated fb1, similar to how I had before, and we're seeing the same instability: process crashes when it tries to write out the second trend (technically it looks like it crashes while it's trying to write out the full frame while the second trend is also being written out). Jonathan Hanks and I are actively looking into it and i'll provide further report soon.
I had done some modeling and measurement of some of these noises while I was putting together the initial DRMI noise budget, but I had never put things together in one plot. In Attachment #1, I've plotted the following:
Attachment #2 has an iPython notebook used to generate this plot along with all the data.
Edit 28 Jul 2.30pm: I've added Attachment #3 with traces for different assumed values of the series resistance on the coil driver board - although I have not re-computed the Johnson noise contribution for the various resistances. If we can afford to reduce the actuation range by a factor of 25, then it looks like we get to within a factor of ~5 of the seismic noise at ~150Hz.
Attachment #1 - Measured error signal spectrum with the Oplev loop disabled, measured at the IN1 input for ITMY. The y-axis calibration into urad/rtHz may not be exact (I don't know when this was last calibrated).
From this measurement, I've attempted to disentangle what is the seismic noise contribution to the measured plant output.
It remains to characterize various other noise sources.
I have also confirmed that the "QPD" Simulink block, which is what is used for Oplevs, does indeed have the PIT and YAW outputs normalized by the SUM (see Attachment #2). This was not clear to me from the MEDM screen.
GV 30 Jul 5pm: I've included in Attachment #3 the block diagram of the general linear feedback topology, along with the specific "disturbances" and "noises" w.r.t. the Oplev loop. The measured (open loop) error signal spectrum of Attachment #1 (call it y) is given by:
If it turns out that one (or more) term(s) in each of the summations above dominates in all frequency bands of interest, then I guess we can drop the others. An elog with a first pass at a mathematical formulation of the cost-function for controller optimization to follow shortly.
About 3.5 hours ago, all the PSL wall StripTool traces "flatlined", as happens when we had the EPICS freezes in the past - except that all these traces were flat for more than 3 hours. I checked that the c1psl slow machine responded to ping, and I could also telnet into it. I tried opening the StripTool on pianosa and all the traces were responsive. So I simply re-started the PSL StripTool on zita. All traces look responsive now.
This week Jonathan Hanks and I have been trying to diagnose why the daqd has been unstable in the configuration used by the 40m, with data concentrator (dc) and frame writer (fw) in the same process (referred to generically as 'fb'). Jonathan has been digging into the core dumps and source to try to figure out what's going on, but he hasn't come up with anything concrete yet.
As an alternative, we've started experimenting with a daqd configuration with the dc and fw components running in separate processes, with communication over the local loopback interface. The separate dc/fw process model more closely matches the configuration at the sites, although the sites put dc and fwprocesses on different physical machines. Our experimentation thus far seems to indicate that this configuration is stable, although we haven't yet tested it with the full configuration, which is what I'm attempting to do now.
Unfortunately I'm having trouble with the mx_stream communication between the front ends and the dc process. The dc does not appear to be receiving the streams from the front ends and is producing a '0xbad' status message for each. I'm investigating.
The PMC was unlocked when I came in ~10mins ago. The wall StripTool traces suggest it has been this way for > 8hours. I was unable to get the PMC to re-lock by using the PMC MEDM screen. The c1psl slow machine responded to ping, and I could also telnet into it. But despite burt-restoring c1psl, I could not get the PMC to lock. So I re-started c1psl by keying the crate, and then burt-restored the EPICS values again. This seems to have done the trick. Both the PMC and IMC are now locked.
Unrelated to this work: It looks like some/all of the FE models were re-started. The x3 gain on the coil outputs of the 2 ITMs and BS, which I had manually engaged when I re-aligned the IFO on Monday, were off, and in general, the IMC and IFO alignment seem much worse now than it was yesterday. I will do the re-alignment later as I'm not planning to use the IFO today.
This was me. I restarted the front ends when I was getting the MX streams working yesterday. I'll try to me more conscientious about logging front end restarts.
In order to test the new daqd config that Jamie has been working on, we felt it would be most convenient for the host name "fb" (martian network IP 192.168.113.202) to point to the physical machine "fb1" (martian network IP 192.168.113.201).
Now, when starting up DTT or dataviewer, the NDS server is automatically found.
More details to follow.
The CDS system is mostly fully recovered at this point. The mx_streams are all flowing from all front ends, and from all models, and the daqd processes are receiving them and writing the data to frames:
Remaining unresolved issues:
Hiro Yamamoto has updated SIS (Static Interferometer Simulation) to allow us to do the MCMC based inference of the 40m arm cavity mirror maps.
The latest version is in git.ligo.org: IFOsim/SIS/
In the examples directory I have put 3 files:
Attached is the plots and the data. The first attached plot is a low resolution one: 200 scans of 100 frequency points each. Second plot is 200 scans of 300 points each.
The run was done assuming perfect LIGO arm params with a random set of Zernike perturbations for each run. The amplitude of each Zernike was chosen from a Normal distribution with a standard deviation of 10 nm.
We need to come up with a better guess for the initial distribution from which to sample, and also to use the more smart sampling that one does using the MCMC Hammer.
I've been trying to put together the cost-function that will be used to optimize the Oplev loop shape. Here is what I have so far.
All of the terms that we want to include in the cost function can be derived from:
From these, we can derive, for a given controller, C(s):
We can add more terms to the cost function if necessary, but I want to get some minimal set working first. All the "requirements" I've quoted above are just numbers out of my head at the moment, I will refine them once I get some feeling for how feasible a solution is for these requirements.
An elog with a first pass at a mathematical formulation of the cost-function for controller optimization to follow shortly.
For a start, I attempted to model the current Oplev loop. The modeling of the plant and open-loop error signal spectrum have been described in the previous elogs in this thread.
I am, however, confused by the controller - the MEDM screen (see Attachment #2) would have me believe that the digital transfer function is FM2*FM5*FM7*FM8*gain(10). However, I get much better agreement between the measured and modelled in-loop error signal if I exclude the overall gain of 10 (see Attachments #1 for the models and #3 for measurements).
What am I missing? Getting this right will be important in specifying Term #4 in the cost function...
GV Edit 2 Aug 0030: As another sanity check, I computed the whitened Oplev control signal given the current loop shape (with sub-optimal high-frequency roll-off). In Attachment #4, I converted the y-axis from urad/rtHz to cts/rtHz using the approximate calibration of 240urad/ct (and the fact that the Oplev error signal is normalized by the QPD sum of ~13000 cts), and divided by 4 to account for the fact that the control signal is sent to 4 coils. It is clear that attempting to whiten the coil driver signals with the present Oplev loop shapes causes DAC saturation. I'm going to use this formulation for Term #4 in the cost function, and to solve a simpler optimization problem first - given the existing loop shape, what is the optimal elliptic low-pass filter to implement such that the cost function is minimized?
There is also the question of how to go about doing the optimization, given that our cost function is a vector rather than a scalar. In the coating optimization code, we converted the vector cost function to a scalar one by taking a weighted sum of the individual components. This worked adequately well.
But there are techniques for vector cost-function optimization as well, which may work better. Specifically, the question is if we can find the (infinite) solution set for which no one term in the error function can be made better without making another worse (the so-called Pareto front). Then we still have to make a choice as to which point along this curve we want to operate at.