ID |
Date |
Author |
Type |
Category |
Subject |
12158
|
Wed Jun 8 13:50:39 2016 |
jamie | Configuration | CDS | Spectracom IRIG-B card installed on fb1 | [EDIT: corrected name of installed card]
We just installed a Spectracom TSyc-PCIe timing card on fb1. The hope is that this will help with the GPS timeing syncronization issues we've been seeing in the new daqd on fb1, hopefully elliminating some of the potential failure channels.
The driver, called "symmetricom" in the advLigoRTS source (name of product from competing vendor), was built/installed (from DCC T1500227):
controls@fb1:~/rtscore/tests/advLigoRTS-40m 0$ cd src/drv/symmetricom/
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$ ls
Makefile stest.c symmetricom.c symmetricom.h
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$ make
make -C /lib/modules/3.2.0-4-amd64/build SUBDIRS=/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom modules
make[1]: Entering directory `/usr/src/linux-headers-3.2.0-4-amd64'
CC [M] /home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.o
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c:59:9: warning: initialization from incompatible pointer type [enabled by default]
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c:59:9: warning: (near initialization for ‘symmetricom_fops.unlocked_ioctl’) [enabled by default]
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c: In function ‘get_cur_time’:
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c:89:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c: In function ‘symmetricom_init’:
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c:188:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c:222:3: warning: label ‘out_remove_proc_entry’ defined but not used [-Wunused-label]
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c:158:22: warning: unused variable ‘pci_io_addr’ [-Wunused-variable]
/home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.c:156:6: warning: unused variable ‘i’ [-Wunused-variable]
Building modules, stage 2.
MODPOST 1 modules
CC /home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.mod.o
LD [M] /home/controls/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom/symmetricom.ko
make[1]: Leaving directory `/usr/src/linux-headers-3.2.0-4-amd64'
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$ sudo make install
#remove all old versions of the driver
find /lib/modules/3.2.0-4-amd64 -name symmetricom.ko -exec rm -f {} \; || true
find /lib/modules/3.2.0-4-amd64 -name symmetricom.ko.gz -exec rm -f {} \; || true
# Install new driver
install -D -m 644 symmetricom.ko /lib/modules/3.2.0-4-amd64/extra/symmetricom.ko
/sbin/depmod -a || true
/sbin/modprobe symmetricom
if [ -e /dev/symmetricom ] ; then \
rm -f /dev/symmetricom ; \
fi
mknod /dev/symmetricom c `grep symmetricom /proc/devices|awk '{print $1}'` 0
chown controls /dev/symmetricom
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$ ls /dev/symmetricom
/dev/symmetricom
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$ ls -al /dev/symmetricom
crw-r--r-- 1 controls root 250, 0 Jun 8 13:42 /dev/symmetricom
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$
|
12161
|
Thu Jun 9 13:28:07 2016 |
jamie | Configuration | CDS | Spectracom IRIG-B card installed on fb1 | Something is wrong with the timing we're getting out of the symmetricom driver, associated with the new spectracom card.
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 127$ lalapps_tconvert
1149538884
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$ cat /proc/gps
704637380.00
controls@fb1:~/rtscore/tests/advLigoRTS-40m/src/drv/symmetricom 0$
The GPS time is way off, and it's counting up at something like 900 seconds/second. Something is misconfigured, but I haven't figured out what yet.
The timing distribution module we're using is spitting out what appears to be an IRIG B122 signal (amplitude moduled 1 kHz carrier), which I think is what we expect. This is being fed into the "AM IRIG input" connector on the card.
Not sure why the driver is spinning so fast, though, with the wrong baseline time. Reboot of the machine didn't help. |
12162
|
Thu Jun 9 15:14:46 2016 |
jamie | Update | CDS | old fb restarted, test of new daqdon fb1 aborted for time being | I've restarted the old daqd on fb until I can figure out what's going on with the symmetricom driver on fb1.
Steve: Jamie with hair.... long time ago
|
12166
|
Fri Jun 10 12:09:01 2016 |
jamie | Configuration | CDS | IRIG-B debugging | Looks like we might have a problem with the IRIG-B output of the GPS receiver.
Rolf came over this morning to help debug the strange symmetricom driver behavior on fb1 with the new Spectracom card. We restarted the machine againt and this time when we loaded the drive rit was clocking at a normal rate (second/second). However, the overall GPS time was still wrong, showing a time in October from this year.
The IRIG-B122 output is supposed to encode the time of year via amplitude modulation of a 1kHz carrier. The current time of year is:
controls@fb1:~ 0$ TZ=utc date +'%j day, %T'
162 day, 18:57:35
controls@fb1:~ 0$
The absolute year is not encoded, though, so the symmetricon driver has the year offset hard coded into the driver (yuck), to which it adds the time of year from the IRIG-B signal to get the correct GPS time.
However, loading the symmetricom module shows the following:
...
[ 1601.607403] Spectracom GPS card on bus 1; device 0
[ 1601.607408] TSYNC PIC BASE 0 address = fb500000
[ 1601.607429] Remapped 0xffffc90017012000
[ 1606.606164] TSYNC NOT receiving YEAR info, defaulting to by year patch
[ 1606.606168] date = 299 days 18:28:1161455320
[ 1606.606169] bcd time = 1161455320 sec 959 milliseconds 398 microseconds 959398630 nanosec
[ 1606.606171] Board sync = 1
[ 1606.616076] TSYNC NOT receiving YEAR info, defaulting to by year patch
[ 1606.616079] date = 299 days 18:28:1161455320
[ 1606.616080] bcd time = 1161455320 sec 969 milliseconds 331 microseconds 969331350 nanosec
[ 1606.616081] Board sync = 1
controls@fb1:~ 0$
Apparently the symmetricom driver thinks it's the 299nth day of the year, which of course corresponds to some time in october, which jives with the GPS time the driver is spitting out.
Rolf then noticed that the timing module in the VME crate in the adjacent rack, which also receives an IRIG-B signal from the distribution box, was also showing day 299 on it's front panel display. We checked and confirmed that the symmetricom card and the VME timing module both agree on the wrong time of year, strongly suggesting that the GPS receiver is outputing bogus data on it's IRIG-B output, even though it's showing the correct time on it's front panel. We played around with setting in the GPS receiver to no avail. Finally we rebooted the GPS receiver, but it seemed to come up with the same bogus IRIG-B output (again both symmetricom driver and VME timing module agree on the wrong day).
So maybe our GPS receiver is busted? Not sure what to try now.
|
12167
|
Fri Jun 10 12:21:54 2016 |
jamie | Configuration | CDS | GPS receiver not resetting properly | The GPS receiver (EndRun Technologies box in 1Y5? (rack closest to door)) seems to not coming back up properly after the reboot. The front pannel says that it's "LKD", but the "sync" LED is flashing instead of solid, and the time of year displayed on the front panel is showing day 6. The fb1 symmetricom driver and VME timing module are still both seeing day 299, though. So something may definitely be screwy with the GPS receiver. |
12179
|
Tue Jun 14 19:37:40 2016 |
jamie | Update | CDS | Overnight daqd test underway | I'm running another overnight test with new daqd software on fb1. The normal daqd process on fb has been shutdown, and the front ends are sending their signals to fb1.
fb1 is running separate data concentrator (dc) and frame writer (fw) processes, to see if this is a more stable configuration than the all-in-one framebuilder (fb) that we have been trying to run with. I'll report on the test tomorrow. |
12181
|
Wed Jun 15 09:52:02 2016 |
jamie | Update | CDS | Very encouraging results from overnight split daqd test | Very encouraging results from the test last night. The new configuration did not crash once overnight, and seemed to write out full, second trend, and minute trend frames without issue . However, full validity of all the written out frames has not been confirmed.
overview
The configuration under test involves two separate daqd binaries instead of one. We usually run with what is referred to as a "framebuilder" (fb) configuration:
- fb: a single daqd binary that:
- collect the data from the front ends
- coallate full data into frame file format
- calculates trend data
- writes frame files to disk.
The current configuration separates the tasks into multiple separate binaries: a "data concentrator" (dc) and a "frame writer" (fw):
- dc:
- collect data from front ends
- coallate full data into frame file format
- broadcasts frame files over local network
- fw:
- receives frame files from broadcast
- calculates trend data
- writes frame files to disk
This configuration is more like what is run at the sites, where all the various components are separate and run on separate hardware. In our case, I tried just running the two binaries on the same machine, with the broadcast going over the loopback interface. None of the systems that use separated daqd tasks see the failures that we've been seeing with the all-in-one fb configuration (and other sites like AEI have also seen).
My guess is that there's some busted semaphore somewhere in daqd that's being shared between the concentrator and writer components. The writer component probably aquires the lock while it's writing out the frame, which prevents the concentrator for doing what it needs to be doing while the frame is being written out. That causes the concentrator to lock up and die if the frame writing takes too long (which it seems to almost necessarily do, especially when trend frames are also being written out).
results
The current configuration hasn't been tweaked or optimized at all. There is of course basically no documentation on the meaning of the various daqdrc directives. Hopefully I can get Keith Thorne to help me figure out a well optimized configuration.
There is at least one problem whereby the fw component is issuing an excessively large number of re-transmission requests:
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 6 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 8 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 3 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 5 packets; port 7097
2016-06-15_09:46:22 [Wed Jun 15 09:46:22 2016] Ask for retransmission of 6 packets; port 7097
2016-06-15_09:46:23 [Wed Jun 15 09:46:23 2016] Ask for retransmission of 1 packets; port 7097
It's unclear why. Presumably the retransmissions requests are being honored, and the fw eventually gets the data it needs. Otherwise I would hope that there would be the appropriate errors.
The data is being written out as expected:
full/11500: total 182G
drwxr-xr-x 2 controls controls 132K Jun 15 09:37 .
-rw-r--r-- 1 controls controls 69M Jun 15 09:37 C-R-1150043856-16.gwf
-rw-r--r-- 1 controls controls 68M Jun 15 09:37 C-R-1150043840-16.gwf
-rw-r--r-- 1 controls controls 68M Jun 15 09:37 C-R-1150043824-16.gwf
-rw-r--r-- 1 controls controls 69M Jun 15 09:36 C-R-1150043808-16.gwf
-rw-r--r-- 1 controls controls 69M Jun 15 09:36 C-R-1150043792-16.gwf
-rw-r--r-- 1 controls controls 68M Jun 15 09:36 C-R-1150043776-16.gwf
-rw-r--r-- 1 controls controls 68M Jun 15 09:36 C-R-1150043760-16.gwf
-rw-r--r-- 1 controls controls 69M Jun 15 09:35 C-R-1150043744-16.gwf
trend/second/11500: total 11G
drwxr-xr-x 2 controls controls 4.0K Jun 15 09:29 .
-rw-r--r-- 1 controls controls 148M Jun 15 09:29 C-T-1150042800-600.gwf
-rw-r--r-- 1 controls controls 148M Jun 15 09:19 C-T-1150042200-600.gwf
-rw-r--r-- 1 controls controls 148M Jun 15 09:09 C-T-1150041600-600.gwf
-rw-r--r-- 1 controls controls 148M Jun 15 08:59 C-T-1150041000-600.gwf
-rw-r--r-- 1 controls controls 148M Jun 15 08:49 C-T-1150040400-600.gwf
-rw-r--r-- 1 controls controls 148M Jun 15 08:39 C-T-1150039800-600.gwf
-rw-r--r-- 1 controls controls 148M Jun 15 08:29 C-T-1150039200-600.gwf
-rw-r--r-- 1 controls controls 148M Jun 15 08:19 C-T-1150038600-600.gwf
trend/minute/11500: total 152M
drwxr-xr-x 2 controls controls 4.0K Jun 15 07:27 .
-rw-r--r-- 1 controls controls 51M Jun 15 07:27 C-M-1150023600-7200.gwf
-rw-r--r-- 1 controls controls 51M Jun 15 04:31 C-M-1150012800-7200.gwf
-rw-r--r-- 1 controls controls 51M Jun 15 01:27 C-M-1150002000-7200.gwf
The frame sizes look more or less as expected, and they seem to be valid as determined with some quick checks with the framecpp command line utilities. |
12183
|
Wed Jun 15 11:21:51 2016 |
jamie | Update | CDS | still work to do to transition to new configuration/code | Just to be clear, there's still quite a bit of work to fully transition the 40m to this new system/configuration. Once we determine a good configuration we need to complete the install, and modify the setup to run the two binaries instead of just the one. The data is also being written to a raid on the new fb1, and we need to decide if we should use this new raid, or try to figure out how to move the old jetstor raid to the new fb1 machine. |
12191
|
Thu Jun 16 16:11:11 2016 |
jamie | Update | CDS | upgrade aborted for now | After poking at the new configuration more, it also started to show instability. I couldn't figure out how to make test points or excitations available in this configuration, and adding in the full set of test point channels, and trying to do simple things like plotting channels with dtt, the frame writer (fw) would fall over, apparetnly unable to keep up with the broadcast from the dc.
I've revered everything back to the old semi-working fb configuration, and will be kicking this to the CDS group to deal with. |
12201
|
Mon Jun 20 11:19:41 2016 |
jamie | Configuration | CDS | GPS receiver not resetting properly |
I got the email from them. There was apparently a bug that manifested on February 14 2016. I'll try to software update today.
http://endruntechnologies.com/pdf/FSB160218.pdf
http://endruntechnologies.com/upgradetemplx.htm |
12202
|
Mon Jun 20 14:03:04 2016 |
jamie | Configuration | CDS | EndRun GPS receiver upgraded, fixed | I just upgraded the EndRun Technologies Tempus LX GPS receiver timing unit, and it seems to have fixed all the problems. 
Thanks to Steve for getting the info from EndRun. There was indeed a bug in the firmware that was fixed with a firmware upgrade.
I upgraded both the system firmware and the firmware of the GPS subsystem:
Tempus LX GPS(root@Tempus:~)-> gntpversion
Tempus LX GPS 6010-0044-000 v 5.70 - Wed Oct 1 04:28:34 UTC 2014
Tempus LX GPS(root@Tempus:~)-> gpsversion
F/W 5.10 FPGA 0416
Tempus LX GPS(root@Tempus:~)->
After reboot the system is fully functional, displaying the correct time, and outputting the correct IRIG-B data, as confirmed by the VME timing unit.
I added a wiki page for the unit: https://wiki-40m.ligo.caltech.edu/NTP
Steve added this picture |
12724
|
Mon Jan 16 22:03:30 2017 |
jamie | Configuration | Computers | Megatron update |
Quote: |
We should consider upgrading a few of our workstations to Ubuntu 14 LTS to see how painful it is to run our scripts and DTT and DV. Better to upgrade a bit before we are forced to by circumstance.
|
I would recommend upgrading the workstations to one of the reference operating systems, either SL7 or Debian squeeze, since that's what the sites are moving towards. If you do that you can just install all the control room software from the supported repos, and not worry about having to compile things from source anymore. |
12763
|
Fri Jan 27 17:49:41 2017 |
jamie | Update | CDS | test of new daqd code on fb1 | Just FYI I'm running a test of updated daqd code on fb1.
fb1 has it's own fiber to the daq network switch, so nothing had to be modified to do this test. This *should* not affect anything in the rest of the system, but as we all know these are famous last words.... If something is going haywire, and you can't get in touch with me and can't figure what else to do, you can just log on to fb1 and shut it down. It's not writing any data to any of the network filesystems.
The daqd code under test is from the latest advLigoRTS 3.2.1 tag, which has daqd stability fixes that will hopefully address the problems we were seeing last time I tried this upgrade. We'll see...
I'm going to let it run over the weekend, and will check in periodically. |
12769
|
Sat Jan 28 12:05:57 2017 |
jamie | Update | CDS | test of new daqd code on fb1 |
Quote: |
I'm not sure if this is related, but since today morning, I've noticed that the data concentrator errors have returned. Looking at daqd.log, there is a 1 second timing mismatch error that is being generated. Usually, manually running ntpdate on the front ends fixes this problem, but it did not work today.
|
If this problem started before ~4pm on Friday then it's probably unrelated, since I didn't start any of these tests until after that. If unexplained problem persist then we can try shutting of the fb1 daqd and see if that helps. |
12770
|
Mon Jan 30 18:41:41 2017 |
jamie | Update | CDS | TEST ABORTED of new daqd code on fb1 | I just aborted the fb1 test and reverted everything to the nominal configuration. Everything looks to be operating nominally. Front ends are mostly green except for c1rfm and c1asx which are currently not being acquired by the DAQ, and an unknown IPC error with c1daf. Please let me know if any unusual problems are encountered.
The behavior of daqd on fb1 with the latest release (3.2.1) was not improved. After turning on the full pipe it was back to crashing every 10 minutes or so when the full and second trend frames were being written out. lame. back to the drawing board... |
12794
|
Fri Feb 3 11:03:06 2017 |
jamie | Update | CDS | more testing fb1; DAQ DOWN DURING TEST | More testing of fb1 today. DAQ DOWN UNTIL FURTHER NOTICE.
Testing Wednesday did not resolve anything, but Jonathan Hanks is helping. |
12798
|
Sat Feb 4 12:20:39 2017 |
jamie | Summary | CDS | /cvs/cds/caltech/chans back on svn1.6 |
Quote: |
True - its an issue. Koji and I are updating zita into Ubuntu16 LTS. If it looks like its OK with various tools we'll swap over the others into it. Until then I figure we're best off turning allegra back into Ubuntu12 to avoid a repeat of this kind of conflict. Once the workstations in the LLO control room are running smoothly on a new OS for a year, we can transfer into that. I don't think any of us wants to be the CDS beta tester for DV or DTT.
|
Just to be clear, since there seems to be some confusion, the SVN issue has nothing to do with Debian vs. Ubuntu. SVN made non-backwards compatible changes to their working copy data format that breaks newer checkouts with older clients. You will run into the exact same problem with newer Ubuntu versions.
I recommend the 40m start moving towards the reference operating systems (Debian 8 or SL7) as that's where CDS is moving. By moving to newer Ubuntu versions you're moving away from CDS support, not towards it. |
12799
|
Sat Feb 4 12:29:20 2017 |
jamie | Summary | CDS | /cvs/cds/caltech/chans back on svn1.6 | No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.
Quote: |
Quote: |
True - its an issue. Koji and I are updating zita into Ubuntu16 LTS. If it looks like its OK with various tools we'll swap over the others into it. Until then I figure we're best off turning allegra back into Ubuntu12 to avoid a repeat of this kind of conflict. Once the workstations in the LLO control room are running smoothly on a new OS for a year, we can transfer into that. I don't think any of us wants to be the CDS beta tester for DV or DTT.
|
Just to be clear, since there seems to be some confusion, the SVN issue has nothing to do with Debian vs. Ubuntu. SVN made non-backwards compatible changes to their working copy data format that breaks newer checkouts with older clients. You will run into the exact same problem with newer Ubuntu versions.
I recommend the 40m start moving towards the reference operating systems (Debian 8 or SL7) as that's where CDS is moving. By moving to newer Ubuntu versions you're moving away from CDS support, not towards it.
|
|
12800
|
Sat Feb 4 12:50:01 2017 |
jamie | Summary | CDS | /cvs/cds/caltech/chans back on svn1.6 |
Quote: |
No, not confused on that point. We just will not be testing OS versions at the 40m or running multiple OS's on our workstations. As I've said before, we will only move to so-called 'reference' systems once they've been in use for a long time.
|
Ubuntu16 is not to my knowledge used for any CDS system anywhere. I'm not sure how you expect to have better support for that. There are no pre-compiled packages of any kind available for Ubuntu16. Good luck, you big smelly doofuses. Nyah, nyah, nyah. |
13108
|
Mon Jul 10 21:03:48 2017 |
jamie | Update | General | All FEs down |
Quote: |
However, FB still will not boot up. The error is identical to that discussed in this thread by Intel. It seems FB is having trouble finding its boot disk. I was under the impression that only the FE machines were diskless, and that FB had its own local boot disk - in which case I don't know why this error is showing up. According to the linked thread, it could also be a problem with the network card/cable, but I saw both lights on the network switch port FB is connected to turn green when I powered the machine on, so this seems unlikely. I tried following the steps listed in the linked thread but got nowhere, and I don't know enough about how FB is supposed to boot up, so I am leaving things in this state now.
|
It's possible the fb bios got into a weird state. fb definitely has it's own local boot disk (*not* diskless boot). Try to get to the BIOS during boot and make sure it's pointing to it's local disk to boot from.
If that's not the problem, then it's also possible that fb's boot disk got fried in the power glitch. That would suck, since we'd have to rebuild the disk. If it does seem to be a problem with the boot disk then we can do some invasive poking to see if we can figure out what's up with the disk before rebuilding. |
13115
|
Wed Jul 12 14:52:32 2017 |
jamie | Update | General | All FEs down | I just want to mention that the situation is actually much more dire than we originally thought. The diskless NFS root filesystem for all the front-ends was on that fb disk. If we can't recover it we'll have to rebuilt the front end OS as well.
As of right now none of the front ends are accessible, since obviously their root filesystem has disappeared. |
13151
|
Sat Jul 29 16:24:55 2017 |
jamie | Update | General | PSL StripTool flatlined |
Quote: |
Unrelated to this work: It looks like some/all of the FE models were re-started. The x3 gain on the coil outputs of the 2 ITMs and BS, which I had manually engaged when I re-aligned the IFO on Monday, were off, and in general, the IMC and IFO alignment seem much worse now than it was yesterday. I will do the re-alignment later as I'm not planning to use the IFO today. |
This was me. I restarted the front ends when I was getting the MX streams working yesterday. I'll try to me more conscientious about logging front end restarts. |
13340
|
Thu Sep 28 11:13:32 2017 |
jamie | Update | CDS | 40m files backup situation |
Quote: |
After consulting with Jamie, we reached the conclusion that the reason why the root of FB1 is so huge is because of the way the RAID for /frames is setup. Based on my googling, I couldn't find a way to exclude the nfs stuff while doing a backup using dd, which isn't all that surprising because dd is supposed to make an exact replica of the disk being cloned, including any empty space. So we don't have that flexibility with dd. The advantage of using dd is that if it works, we have a plug-and-play clone of the boot disk and root filesystem which we can use in the event of a hard-disk failure.
- One option would be to stop all the daqd processes, unmount /frames, and then do a dd backup of the true boot disk and root filesystem.
- Another option would be to use rsync to do the backup - this way we can selectively copy the files we want and ignore the nfs stuff. I suspect this is what we will have to do for the second layer of backup we have planned, which will be run as a daily cron job. But I don't think this approach will give us a plug-and-play replacement disk in the event of a disk failure.
- Third option is to use one of the 2TB HGST drives, and just do a dd backup - some of this will be /frames, but that's okay I guess.
|
This is not quite right. First of all, /frames is not NFS. It's a mount of a local filesystem that happens to be on a RAID. Second, the frames RAID is mounted at /frames. If you do a dd of the underlying block device (in this case /dev/sda*, you're not going to copy anything that's mounted on top of it.
What i was saying about /frames is that I believe there is data in the underlying directory /frames that the frames RAID is mounted on top of. In order to not get that in the copy of /dev/sda4 you would need to unmount the frames RAID from /frames, and delete everything from the /frames directory. This would not harm the frames RAID at all.
But it doesn't really matter because the backup disk has space to cover the whole thing so just don't worry about it. Just dd /dev/sda to the backup disk and you'll just be copying the root filesystem, which is what we want. |
13344
|
Fri Sep 29 09:43:52 2017 |
jamie | HowTo | CDS | pyawg |
Quote: |
I've modified the __init.py__ file located at /ligo/apps/linux-x86_64/cdsutils-480/lib/python2.7/site-packages/cdsutils/__init__.py so that you can now simply import pyawg from cdsutils . On the control room workstations, iPython is set up such that cdsutils is automatically imported as "cds ". Now this import also includes the pyawg stuff. So to use some pyawg function, you would just do (for example):
exc=cds.awg.ArbitraryLoop(excChan,excit,rate=fs)
One could also explicitly do the import if cdsutils isn't automatically imported:
from cdsutils import awg
pyawg- away!
Linking this useful instructional elog from Chris here: https://nodus.ligo.caltech.edu:8081/Cryo_Lab/1748
|
? Why aren't you able to just import 'awg' directly? You shouldn't have to import it through cdsutils. Something must be funny with the config. |
13350
|
Mon Oct 2 18:50:55 2017 |
jamie | Update | CDS | c1ioo DC errors |
Quote: |
- This time the model came back up but I saw a "0x2000" error in the GDS overview MEDM screen.
- Since there are no DACs installed in the c1ioo expansion chassis, I thought perhaps the problem had to do with the fact that there was a "DAC_0" block in the c1x03 simulink diagram - so I deleted this block, recompiled c1x03, and for good measure, restarted all (three) models on c1ioo.
- Now, however, I get the same 0x2000 error on both the c1x03 and c1als GDS overview MEDM screens (see Attachment #1).
|
From page 21 of T1100625, DAQ status "0x2000" means that the channel list is out of sync between the front end and the daqd. This usually happens when you add channels to the model and don't restart the daqd processes, which sounds like it might be applicable here.
It looks like open-mx is loaded fine (via "rtcds lsmod"), even though the systemd unit is complaining. I think this is because the open-mx service is old style and is not intended for module loading/unloading with the new style systemd stuff. |
13383
|
Tue Oct 17 17:53:25 2017 |
jamie | Summary | LSC | prep for tests of Gabriele's neural network cavity length reconstruction | I've been preparing for testing Gabriele's deep neural network MICH/PRCL reconstruction. No changes to the front end have been made yet, this is all just prep/testing work.
Background:
We have been unable to get Gabriele's nn.c code running in kernel space for reasons unknown (see tests described in previous post). However, Rolf recently added functionality to the RCG that allows front end models to be run in user space, without needing to be loaded into the kernel. Surprisingly, this seems to work very well, and is much more stable for the overall system (starting/stopping the user space models will not ever crash the front end machine). The nn.c code has been running fine on a test machine in this configuration. The RCG version that supports user space models is not that much newer than what the 40m is running now, so we should be able to run user space models on the existing system without upgrading anything at the 40m. Again, I've tested this on a test machine and it seems to work fine.
The new RCG with user space support compiles and installs both kernel and user-space versions of the model.
Work done:
- Create 'c1dnn' model for the nn.c code. This will run on the c1lsc front end machine (on core 6 which is currently empty), and will communicate with the c1lsc model via SHMEM IPC. It lives at:
- /opt/rtcds/userapps/release/isc/c1/models/c1dnn.mdl
- Got latest copy of nn.c code from Gabriele's git, and put it at:
- /opt/rtcds/userapps/release/isc/c1/src/nn/
- Checked out the latest version of the RCG (currently SVN trunk r4532):
- /opt/rtcds/rtscore/test/nn-test
- Set up the appropriate build area:
- /opt/rtcds/caltech/c1/rtbuild/test/nn-test
- Built the model in the new nn-test build directory ("make c1dnn")
- Installed the model from the nn-test build dir ("make install-c1dnn")
Test:
I tried a manual test of the new user space model. Since this is a user space process running it should have no affect on the rest of the front end system (which it didn't):
- Manually started the c1dnn EPICS IOC:
- Tried running the model user-space process directly:
Unfortunately, the process died with an "ADC TIMEOUT" error. I'm investigating why.
Once we confirm the model runs, we'll add the appropriate SHMEM IPC connections to connect it to the c1lsc model. |
13386
|
Wed Oct 18 01:41:32 2017 |
jamie | Update | CDS | FEs unresponsive |
Quote: |
While working on the IFO tonight, I noticed that the blinky status lights on c1iscex and c1iscey were frozen (but those on the other 3 FEs seemed fine). But all other lights on the CDS overview screen were green I couldn't access testpoints from these machines, and the EPICS readbacks for models on these FEs (e.g. Oplev servo inputs outputs etc) were frozen at some fixed value. This lasted for a good 5 minutes at least. But the blinky lights started blinking again without me doing anything. Not sure what to make of this. I am also not sure how to diagnose this problem, as trending the slow EPICS records of the CPU execution cycle time (for example) doesn't show any irregularity.
|
So this wasn't just an EPICS freeze? I don't see how this had anything to do with any of the work I did earlier today. I didn't modify any of the running front ends, didn't touch either of the end station machines or the DAQ, and didn't modify the network in any way. I didn't leave anything running.
If you couldn't access test points then it sounds like it was more than just EPICS. It sounds like maybe the end machines somehow fell of the network momentarily. Was there anything else going on at the time? |
13388
|
Wed Oct 18 09:21:22 2017 |
jamie | Update | CDS | FEs unresponsive |
Quote: |
I was looking at the ASDC channel on dataviewer, and toggling various settings like whitening gain. At some point, the signal just froze. So I quit dataviewer and tried restarting it, at which point it complained about not being able to connect to FB. This is when I brought up the CDS_OVERVIEW medm screen, and noticed the frozen 1pps indicator lights. There was certainly something going on with the end FEs, because I was able to ping the machine, but not ssh into it. Once the 1pps lights came back, I was able to ssh into c1iscex and c1iscey, no problems.
Could it be that some of the mx processes stalled, but the systemctl routine automatically restarted them after some time?
|
An mx_stream glitch would have interrupted data flowing from the front end to the DAQ, but it wouldn't have affected the heartbeat. The heartbeat stop could mean either that the front end process froze, or the EPICS communication stopped. The fact that everything came back fine after a couple of minutes indicates to me that the front end processes all kept running fine. If they hadn't I'm sure the machines would have locked up. The fact that you couldn't connect to the FE machine is also suspicious.
My best guess is that there was a network glitch on the martian network. I don't know how to account for the fact that pings still worked, though. |
13390
|
Wed Oct 18 12:14:08 2017 |
jamie | Summary | LSC | prep for tests of Gabriele's neural network cavity length reconstruction |
Quote: |
I tried a manual test of the new user space model. Since this is a user space process running it should have no affect on the rest of the front end system (which it didn't):
- Manually started the c1dnn EPICS IOC:
- Tried running the model user-space process directly:
Unfortunately, the process died with an "ADC TIMEOUT" error. I'm investigating why.
Once we confirm the model runs, we'll add the appropriate SHMEM IPC connections to connect it to the c1lsc model.
|
I tried moving the model to c1ioo, where there are plenty of free cores sitting idle, and the model seems runs fine. I think the problem was just CPU contention on the c1lsc machine, where there were only two free cores and the kernel was using both for all the rest of the normal user space processes.
So there are two options:
- Use cpuset on c1lsc to tell the kernel to remove all other processes from CPU6 and save it just for the c1dnn model. This should not have any impact on the running of c1lsc, since that's exactly what would be happening if we were running the model in kernel space (e.g. isolating the core for the front end model). The auxilliary support user space processes (epics seq/ioc, awgtpman) should all run fine on CPU0, since that's what usually happens. Linux is only using the additional core since it's there. We don't have much experience with cpuset yet, though, so more offline testing will be required first.
- Run the model on c1ioo and ship the needed signals to/from c1lsc via PCIe dolphin. This is potentially slightly more invasive of a change, and would put more work on the dolphin network, but it should be able to handle it.
I'm going to start testing cpuset offline to figure out exactly what would need to be done. |
13395
|
Thu Oct 19 15:42:03 2017 |
jamie | Summary | LSC | MICH/PRCL reconstruction neural network running on c1lsc | Gabriele's PRCL/MICH reconstruction neural network is now running on c1lsc. Summary:
- front-end model is called c1dnn, and is running as an experimental user-space process
- c1dnn is getting most of it's needed inputs from existing SHMEM IPC outputs from c1lsc
- none of the output signals from the network are being sent anywhere yet (grounded)
- c1dnn has not been integrated in any way, into the DAQ etc. it is being run manually by hand, and will be completely shut down after this test
Simple MEDM screen I made to monitor the input/output signals:

The RTS process seems to run fine, but there is quite a bit of jitter in the CPU_METER, at the 50% level:


It's not running over the limit, but it is jumping around more than I think it should be. Will look into that...
cpuset for cpu isolation for user-space model
The c1dnn model is running on CPU6 on c1lsc. CPU6 was isolated from the rest of the system using cpuset. The "cset" utility was used to create a "system" CPU set that was assigned to CPU0, and the kernel was instructed to move all running processes to that set:
controls@c1lsc:~ 2$ sudo cset set
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0,6 y 0 y 343 0 /
controls@c1lsc:~ 0$ sudo cset set -c 0 -s system --cpu_exclusive
cset: --> created cpuset "system"
controls@c1lsc:~ 0$ sudo cset set
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0,6 y 0 y 342 1 /
system 0 y 0 n 0 0 /system
controls@c1lsc:~ 0$ sudo cset proc --move -f root -t system -k
cset: moving all tasks from root to /system
cset: moving 292 userspace tasks to /system
cset: moving 0 kernel threads to: /system
cset: --> not moving 50 threads (not unbound, use --force)
[==================================================]%
cset: done
controls@c1lsc:~ 0$ sudo cset set
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0,6 y 0 y 50 1 /
system 0 y 0 n 292 0 /system
controls@c1lsc:~ 0$ sudo cset proc --move -f root -t system -k --force
cset: moving all tasks from root to /system
cset: moving 50 kernel threads to: /system
[==================================================]%
cset: **> 29 tasks are not movable, impossible to move
cset: done
controls@c1lsc:~ 0$ sudo cset set
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0,6 y 0 y 29 1 /
system 0 y 0 n 313 0 /system
controls@c1lsc:~ 0$
I then created a set for the RTS process ("rts-c1dnn") on CPU6, and executed the c1dnn model in that set:
controls@c1lsc:~ 0$ sudo cset set -c 6 -s rts-c1dnn --cpu_exclusive
cset: --> created cpuset "rts-c1dnn"
controls@c1lsc:~ 0$ sudo cset set
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0,6 y 0 y 24 2 /
rts-c1dnn 6 y 0 n 0 0 /rts-c1dnn
system 0 y 0 n 340 0 /system
controls@c1lsc:~ 0$ sudo cset proc -s rts-c1dnn --exec /opt/rtcds/caltech/c1/target/c1dnn/bin/c1dnn -- -m c1dnn
cset: --> last message, executed args into cpuset "/rts-c1dnn", new pid is: 27572
sysname = c1dnn
....
When done I just hit Ctrl-C.
I left the cpusets as they are, with all system processes in the "system" set. This should not pose any problems since it's the identical configuration as would be if a normal kernel-level model was running in CPU6.
The c1dnn process and it's EPICS sequencer were shutdown after this test. |
13400
|
Tue Oct 24 20:14:21 2017 |
jamie | Summary | LSC | further testing of c1dnn integration; plugged in to DAQ | In order to try to isolate CPU6 for the c1dnn neural network reconstruction model, I set the CPUAffinity in /etc/systemd/system.conf to "0" for the front end machines. This sets the cpu affinity for the init process, so that init and all child processes are run on CPU0. Unfortunately, this does not affect the kernel threads. So after reboot all user space processes where on CPU0, but the kernel threads were still spread around. Will continue trying to isolate the kernel as well...
In any event, this amount of isolation was still good enough to get the c1dnn user space model running fairly stably. It's been running for the last hour without issue.
I added the c1dnn channel and testpoint files to the daqd master file, and restarted daqd_dc on fb1, so now the c1dnn channels and test points are available through dataviewer etc. We were then able to observe the reconstructed signals:


We'll need to set the phase rotation of the demodulated RF PD signals (REFL11, REFL55, AS55, POP22) to match them with what the NN expects... |
13411
|
Mon Nov 6 18:22:48 2017 |
jamie | Summary | LSC | current procedure for running c1dnn code | This is the current procedure to start the c1dnn model:
$ ssh c1lsc
$ sudo systemctl start rts-epics@c1dnn
$ sudo systemctl start rts-awgtpman@c1dnn
$ sudo /usr/bin/cset proc -s rts-c1dnn --exec /opt/rtcds/caltech/c1/target/c1dnn/bin/c1dnn -- -m c1dnn
...
Then to shutdown:
...
Ctrl-C
$ sudo systemctl stop rts-awgtpman@c1dnn
$ sudo systemctl stop rts-epics@c1dnn
The daqd already knows about this model, so nothing should need to be done to the daqd to make the dnn channels available. |
13480
|
Fri Dec 15 01:53:37 2017 |
jamie | Update | CDS | CDS recovery, NFS woes |
Quote: |
I would make a detailed post with how the problems were fixed, but unfortunately, most of what we did was not scientific/systematic/repeatable. Instead, I note here some general points (Jamie/Koji can addto /correct me):
- There is a "known" problem with unloading models on c1lsc. Sometimes, running rtcds stop <model> will kill the c1lsc frontend.
- Sometimes, when one machine on the dolphin network goes down, all 3 go down.
- The new FB/RCG means that some of the old commands now no longer work. Specifically, instead of telnet fb 8087 followed by shutdown (to fix DC errors) no longer works. Instead, ssh into fb1, and run sudo systemctl restart daqd_*.
|
This should still work, but the address has changed. The daqd was split up into three separate binaries to get around the issue with the monolithic build that we could never figure out. The address of the data concentrator (DC) (which is the thing that needs to be restarted) is now 8083.
Quote: |
UPDATE 8:20pm:
Koji suggested trying to simply retsart the ASS model to see if that fixes the weird errors shown in Attachment #2. This did the trick. But we are now faced with more confusion - during the restart process, the various indicators on the CDS overview MEDM screen froze up, which is usually symptomatic of the machines being unresponsive and requiring a hard reboot. But we waited for a few minutes, and everything mysteriously came back. Over repeated observations and looking at the dmesg of the frontend, the problem seems to be connected with an unresponsive NFS connection. Jamie had noted sometime ago that the NFS seems unusually slow. How can we fix this problem? Is it feasible to have a dedicated machine that is not FB1 do the NFS serving for the FEs?
|
I don't think the problem is fb1. The fb1 NFS is mostly only used during front end boot. It's the rtcds mount that's the one that sees all the action, which is being served from chiara. |
16325
|
Tue Sep 14 15:57:05 2021 |
jamie | Frogs | CDS | fb1 /var full after reboot, caused all sorts of problems | /var on fb1 filled up today, which caused all sorts of CDS issues. I found out about the problem by reading the logs of the services that were having trouble running, in which they complained about not being able to write to disk. I looked at the filesystem status with 'df' and noticed that /var was full, which is where applications write temporary data, and will always cause problems if it's full.
I tracked the issue down to multiple multi-gigabyte log files: /var/log/messages and /var/log/messages.1. They were full of lines like this one:
Aug 29 06:25:21 fb1 kernel: l called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl ca
Seems like something related to the gpstime kernel module?
Anyway, I deleted the log files for now, which cleared up the space on /var. Things should be back to normal now, until the logs fill up again... |
16327
|
Tue Sep 14 16:44:54 2021 |
jamie | Frogs | CDS | fb1 /var full after reboot, caused all sorts of problems | Jonathan Hanks pointed me to this fix to the gpstime kernel module that was unfortunately put in after the 3.4 release that we're currently using:
https://git.ligo.org/cds/advligorts/-/commit/6f6d6e2eb1d3355d0cbfe9fe31ea3b59af1e7348
I hacked the source in place (/usr/src/gpstime-3.4/drv/gpstime/gpstime.c) to get the fix, and then rebuilt the kernel module with dkms :
sudo dkms uninstall gpstime/3.4
sudo dkms install gpstime/3.4
I then stopped daqd_dc, unloaded gpstime, reloaded it, restarted daqd_dc. The messages are no longer showing up in /var/log/messages, so I think we're ok for the moment.
NOTE: the fix will be undone if we for some reason reinstall the advligorts-gpstime-dkms package. There shouldn't be a need to do that, but we should be aware. I'm discussing with Jonathan if we want to try to push out a new debian package to fix this issue... |
6412
|
Wed Mar 14 05:26:39 2012 |
interferomter tack force | Update | General | daytime tasks | The following tasks need to be done in the daytime tomorrow.
- Hook up the DC output of the Y green BBPD on the PSL table to an ADC channel (Jamie / Steve)
- Install fancy suspension matrices on PRM and ITMX [#6365] (Jenne)
- Check if the REFL165 RFPD is healthy or not (Suresh / Koji)
- According to a simulation the REFL165 demod signal should show similar amount of the signal to that of REFL33.
- But right now it is showing super tiny signals [#6403]
|
6416
|
Wed Mar 14 14:09:01 2012 |
interferomter tack force | Update | General | daytime tasks |
Quote: |
The following tasks need to be done in the daytime tomorrow.
- Hook up the DC output of the Y green BBPD on the PSL table to an ADC channel (Jamie / Steve)
- Install fancy suspension matrices on PRM and ITMX [#6365] (Jenne)
- Check if the REFL165 RFPD is healthy or not (Suresh / Koji)
- According to a simulation the REFL165 demod signal should show similar amount of the signal to that of REFL33.
- But right now it is showing super tiny signals [#6403]
|
For ITMX, I used the values from the conlog:
2011/08/12,20:10:12 utc 'C1:SUS[-_]ITMX[-_]INMATRIX'
These are the latest values in the conlog that aren't the basic matricies. Even though we did a round of diagonalization in Sept, and the
matricies are saved in a .mat file, it doesn't look like we used the ITMX matrix from that time.
For PRM, I used the matricies that were saved in InputMatricies_16Sept2011.mat, in the peakFit folder, since I couldn't find anything in the Conlog other than the basic matricies.
UPDATE: I didn't actually count the number of oscillations until the optics were damped, so I don't have an actual number for the Q, but I feel good about the damping, after having kicked POS of both ITMX and PRM and watching the sensors. |
2140
|
Sun Oct 25 14:29:45 2009 |
haixing, kiwamu | Configuration | General | SR785 spectrum analyzer | In this morning, we have disconnected SR785 which was in front of 1X2 rack, to measure a Hall sensor noise.
After a while, we put back SR785 and re-connected as it has been.
But the display setup might have been changed a little bit.
|
2246
|
Thu Nov 12 01:18:34 2009 |
haixing | Update | SUS | open-loop transfer function of mag levi system (comparison between simulink and measurement) | I built a Simulink model of the magnetic levitation system and try to explain the dip in the open-loop transfer function that was observed.
One can download the model in the svn. The corresponding block diagram is shown by the figure below.

Here "Magnet" is equal to inverse of the magnet mass. Integrator "1/s" gives the velocity of the magnet. A further integrator gives the displacement of the magnet.
Different from the free-mass response, the response of the magnet is modified due to the existence of the Eddy-current damping and negative spring in the vertical
direction, as indicated by the feedback loops after two integrals respectively. The motion of the magnet will change the magnetic field strength which in turn will pick
up by the Hall-effect sensor. Unlike the usual case, here the Hall sensor also picks up the magnetic field created by the coil as indicated by the loop below the mechanical
part. This is actually the origin of the dip in the open-loop transfer function. In the figure below, we show the open-loop transfer function and its phase contributed by both
the mechanical motion of the magnet and the Hall sensor with the black curve "Total". The contribution from the mechanical motion alone is shown by the magenta curve
"Mech" which is obtained by disconnecting the Hall sensor loop (I rescale the total gain to fit the measurement data due to uncertainties in those gains indicated in the figure).
The contribution from the Hall sensor alone is indicated by the blue curve "Hall" which is obtained by disconnecting the mechanical motion loop. Those two contributions
have the different sign as shown by the phase plot, and they destructively interfere with each other and create the dip in the open-loop transfer function.

In the following figure, we show the close-loop response function of the mechanical motion of the magnet.

As we can see, even though the entire close loop of the circuit is stable, the mechanical motion is unstable around 10 Hz. This simply comes from the fact that
around this frequency, the Hall sensor almost has no response to the mechanical motion due to destructive interference as mentioned.
In the future, we will replace the Hall sensor with an optical one to get rid of this undesired destructive interference.
|
2274
|
Mon Nov 16 15:18:10 2009 |
haixing | Update | SUS | Stable magnetic levitation without eddy-current damping |
By including a differentiator from 10 Hz to 50 Hz, we increase the phase margin and the resulting
magnetic levitation system is stable even without the help of eddy-current damping.
The new block diagram for the system is the following:

Here the eddy-current damping component is removed and we add an additional differential
circuit with an operational amplifier OP27G.
In addition, we place the Hall sensor below the magnet to minimize the coupling between
the coil and the Hall sensor.
The resulting levitation system is shown by the figure below:

|
4086
|
Wed Dec 22 11:24:23 2010 |
haixing | Update | SUS | measurement of imbalance in quadrant maglev protope | Yesterday, a sequence of force and gain measurement was made to determine the imbalance in the
quadrant, magnetic-levitation prototype. This was the reason why it failed to achieve a stable levitation.
The configuration is shown schematically by the figure below:

Specifically, the following measurements have been made:
(1) DC force measurement among four pairs of magnets at fixed distance with current of the coils on and off
From this measurement, the DC force between pair of magnets is determined and is around 1.6 N at with a
separation of 1 cm. This measurement also lets us know the gain from voltage to force near the working point.
The force between pair "2" is about 13% stronger than other pairs which are nearly identical. The force by the
coil is around 0.017 N per Volt (levitation of 5 g per 3 Volt); therefore, we need around 12 volt DC compensation
of pair "2" in order to counterbalance such an imbalance. Given the resistence of the coil equal to 26 Om, this
requires almost 500 mA DC compensation. Koji suggested that we need a high-current buffer, instead of what
has been used now.
(2) DC force measurement among four pairs of magnets (with current of the coils off) as a function of distance
From this measurement, we can determine the stiffness of the system. In this case, the stiffness or the
effective spring constant is negative, and we need to compensate it by using a feedback control. This is
one of the most important parameters for designing the feedback control. The data is still in processing.
(3) Gain measurement of the OSEM from the displacement to voltage.
This measurement is a little bit tricky due to the difficulty to determine the displacement of the flag.
After several measurements, it gave approximately 2 V/cm.
Plan for the next few days:
From the those measurements, all the parameters for the plant and sensor that need to determine the
feedback control are known. They should be plugged into the simulink model and to see whether the
old design is appropriate or not. Concerning the experimental part, we will first try to levitate the configuration
with 2 pairs of magnets, instead of 4 pairs, as the first step, which is easier to control but still interesting.
|
4906
|
Wed Jun 29 01:23:21 2011 |
haixing | Update | SUS | issues in the current quad maglev system | Here I show several issues that we have encountered in the quad magnetic levitation system. It would be great if you can give
some suggestions and comments (Poor haixing is crying for help)
The current setup is shown by the figure below (I took the photo this morning):

Basically, we have one heavy load which is rigidly connected to a plane that we try to levitate. On corners of the
plane, there are four push-fit permanent magnets. Those magnets are attracted by four other magnets which are
mounted on the four control coils (the DC force is to counteract the DC gravity). By sensing the position of the plane
with four OSEMs (there are four flags attached on the plane), we try to apply feedback control and levitate the plane.
We have made an analog circuit to realize the feedback, but it is not successful. There are the following main issues
that need to be solved:
(1) DC magnetic force is imbalanced, and we found that one pair has a stronger DC force than others. This should
be able to solved simply by replacing them with magnets have comparable strength to others.
(2) The OSEM not only senses the vertical motion, but also the translational motion. One possible fast solution is to
cover the photodiode and only leave a very thin vertical slit so that a small translational motion is not sensed.
Maybe this is too crappy. If you have better ideas, please let me know. Koji suggested to use reflective sensing
instead of OSEM, which can also solve the issue that flags sometimes touche the hole edge of the OSEM and
screw up the sensing.
(3) Cross coupling among different degrees of freedom. Basically, even if the OSEM only senses the vertical motion,
the motion of four flags, which are rigidly connected to the plane, are not independent. In the ideal case, we only
need to control pith, yaw and vertical motion, which only has three degrees of freedom, while we have four sensing outputs
from four OSEMs. This means that we need to work out the right control matrix. Right now, we are in some kind of dilemma.
In order to obtain the control matrix, we first have to get the sensing matrix or calibrate the cross coupling; however, this is
impossible if the system is unstable. This is very different from the case of quad suspension control used in LIGO,
in which the test mass is stable suspended and it is relatively easy to measure the cross coupling by driving the test mass
with coils. Rana suggested to include a mechanical spring between the fixed plane and levitated plane, so that
we can have a stable system to start with. I tried this method today, but I did not figure out a nice way to place the spring,
as we got a hole right in the middle of the fixed plane to let the coil connectors go though. As a first trial, I plan to
replace the stop rubber band (to prevent the plane from getting stuck onto the magnets) shown in the figure with mechanical
springs. In this case, the levitated plane is held by four springs instead of one. This is not as good as one, because
of imbalance among the four, but we can use this setup, at least, to calibrate the cross coupling. Let me know if you come
up better solution.
After those issues are solved, we can then implement Jamie's Cymac digital control, which is now under construction,
to achieve levitation. |
4992
|
Tue Jul 19 21:05:55 2011 |
haixing | Update | DAQ | choose the right relay | Rana and I are working on the AA/AI circuit for Cymac. We need relays to bypass certain paths in the circuit, and we just found a nice website
explaining how to choose the right relay:
http:/zone.ni.com/devzone/cda/tut/p/id/2774
This piece of information could be useful for others. |
5019
|
Fri Jul 22 15:39:55 2011 |
haixing | Update | SUS | matching the magnets | Yi Xie and Haixing,
We used the Gauss meter to measure the strength distribution of bought magnets, which follows a nice Gaussian distribution.
We pick out four pairs--four fixed magnets and four for the levitated plate that are matched in strength. The force difference is
anticipated to be within 0.2%, and we are going to measure the force as a function of distance to further confirm this.
In the coming week, we will measure various transfer functions in the path from the sensors to the coils (the actuator). The obtained
parameters will be put into our model to determine the control scheme. The model is currently written in mathematica which can
analyze the stability from open-loop transfer function. |
5022
|
Sun Jul 24 20:36:03 2011 |
haixing | Summary | Electronics | AA filter tolerance analysis | Koji and Haixing,
We did a tolerance analysis to specify the conner frequency for passive low-pass filtering in the AA filter of Cymac. The
link to the wiki page for the AA filter goes as follows (one can have a look at the simple schematics):
http://blue.ligo-wa.caltech.edu:8000/40m/Electronics/BNC_Whitening_AA
Basically, we want to add the following passive low-pass filter (boxed) before connecting to the instrumentation amplifier:

Suppose (i) we have 10% error in the capacitor value and (ii) we want to have common-mode rejection
error to be smaller than 0.1% at low frequencies (up to the sampling frequency 64kHz), what would be
conner frequency, or equivalently the values for the capacitor and resistor, for the low-pass filter?
Given the transfer function for this low-pass filter:

and the error propagation equation for its magnitude:

we found that the conner frequency needs to be around 640kHz in order to have
with 
|
5024
|
Sun Jul 24 22:19:19 2011 |
haixing | Summary | Electronics | AA filter tolerance analysis |
>> This sort of OK, except the capacitor connects across the (+) terminals of the two input opamps, and does not connect to ground:

>> Also, we don't care about the CMRR at 64 kHz. We care about it at up to 10 kHz, but not above.
In this case, the conner frequency for the low-pass filter would be around 100kHz in order to satisfy the requirement.
>>And doesn't the value depend on the resistors?
Yes, it does. The error in the resistor (typically 0.1%) is much smaller than that of the capacitor (10%). Since the resistor error propagates in the same as the capacitor,
we can ignore it.
Note that we only specify the conner frequency (=1/RC) instead of R and C specifically from the tolerance analysis, we still need to choose appropriate
values for R and C with the conner frequency fixed to be around 100kHz, for which we need to consider the output impedance of port 1 and port 2.
|
5038
|
Tue Jul 26 21:11:40 2011 |
haixing | Summary | Electronics | AA filter tolerance analysis | 
Given this new setup, we realized that the previous tolerance analysis is incorrect. Because the uncertainty in the capacitance value
does not affect the common mode rejection, as two paths share the same capacitor. Now only the imbalance of two resistors is relevant.
The error propagation formula goes as follows:

We require that the common-mode rejection error at low frequency up to 8kHz, namely
with , one can easily find out that the corner frequency needs to be around 24kHz.
|
11579
|
Fri Sep 4 20:42:14 2015 |
gautam, rana | Update | CDS | Checkout of the Wenzel dividers | Some years ago I bought some dividers from Wenzel. For each arm, we have x256 and a x64 divider. Wired in series, that means we can divide each IR beat by 2^14.
The highest frequency we can read in our digital system is ~8100 Hz. This corresponds to an RF frequency of ~132 MHz which as much as the BBPD could go, but less than the fiber PDs.
Today we checked them out:
- They run on +15V power.
- For low RF frequencies (< 40 MHz) the signal level can be as low as -25 dBm.
- For frequencies up to 130 MHz, the signal should be > 0 dBm.
- In all cases, we get a square wave going from 0 ~ 2.5 V, so the limiter inside keeps the output amplitude roughly fixed at a high level.
- When the RF amplitude goes below the minimum, the output gets shaky and eventually drops to 0 V.
Since this seems promising, we're going to make a box on Monday to package both of these. There will one SMA input and output per channel.
Each channel will have a an amplifier since this need not be a low noise channel. The ZKL-1R5 seems like a good choice to me. G=40 dB and +15 dBm output.
Then Gautam will make a frequency counter module in the RCG which can do counting with square waves and not care about the wiggles in the waveform.
I think this ought to do the trick for our Coarse frequency discriminator. Then our Delay Box ought to be able to have a few MHz range and do all of the Fast ALS Carm that we need. |
11940
|
Wed Jan 20 23:26:10 2016 |
gautam, rana | Update | LSC | PSL and AUX-X temperatures changed | Earlier today, we did a bunch of stuff to see if we could improve the situation with the excess ALS-X noise. Long story short, here are the parameters that were changed, and their initial and final values:
X-end laser diode temperature: 28.5 degrees ---> 31.3 degrees
X-end laser diode current: 1.900 A ---> 1.942 A
X-end laser crystal temperature: 47.43 degrees ---> 42.6 degrees
PSL crystal temperature: 33.43 degrees ---> 29.41 degrees
PSL Diode A temperature: 21.52 degrees ---> 20.75 degrees
PSL Diode B temperature: 22.04 degrees ---> 21.3 degrees
The Y-end laser temperature has not yet been adjusted - this will have to be done to find the Y-beatnote.
Unfortunately, this does not seem to have fixed the problem - I was able to find the beatnote, with amplitude on the network analyzer in the control room consistent with what we've been seeing over the last few days, but as is clear from Attachment 1, the problem persists...
Details:
- PSL shutter was closed and FSS servo input was turned off.
- As I had mentioned in this elog, the beat can now only be found at 47.41 degress +/- 1 deg, which is a shift of almost 5 degrees from the value set sometime last year, ~42.6 degrees. Rana thought it's not a good idea to keep operating the laser at such a high crystal temperature, so we decided to lower the X-end laser temperature back to 42.6 degrees, and then adjust the PSL temperature appropriately such that we found a beat. The diode temperature was also tweaked (this requires using a small screwdriver to twist the little knob inset to the front panel of the laser controller) - for the end laser, we did not have a dedicated power monitor to optimize the diode temperature by maximizing the current, and so we were just doing this by looking at the beat note amplitude on the network analyzer (which wasn't changing by much). So after playing around a little, Rana decided to leave it at 31.3 degrees.
- We then went to the PSL table and swept through the temperature till a beat was found. The PMC wouldn't stay locked throughout the sweep, so we first did a coarse scan, and saw weak (due to the PMC being locked to some weird mode) beatnotes at some temperatures. We then went back to 29.41 degrees, and ran the PMC autolocker script to lock the PMC - a nice large beatnote was found.
- Finally, Rana tweaked the temperatures of the two diodes on the PSL laser controller - here, the optimization was done more systematically, by looking at the PMC transmitted power on the oscilloscope (and also the MEDM screen) as a function of the diode temperature.
- I took a quick look at the ALS out of loop noise - and unfortunately, our efforts today does not seem to have noticeably improved anything (although the bump at ~1kHz is no longer there).
Some details not directly related to this work:
- There are long cables (routed via cable tray) suitable for RF signals that are running from the vertex to either end-table - these are labelled. We slightly re-routed the one running to the X-end, sending it to the IOO rack via the overhead cable tray so that we could send the beat signal from the frequency counter module to the X-end, where we could look at it using an analyzer while also twiddling laser parameters.
- A webcam (that also claims to have two-way audio!) has been (re?)installed on the PSL table. The ethernet connection to the webcam currently goes to the network switch on the IOO rack (though it is unlabelled at the moment)
- The X-end area is due for a clean-up, I will try and do some of this tomorrow.
|
11599
|
Tue Sep 15 15:10:48 2015 |
gautam, ericq, rana | Summary | LSC | PRFPMI lock & various to-do's | I was observing Eric while he was attempting to lock the PRFPMI last night. The handoff from ALS to LSC was not very smooth, and Rana suggested looking at some control signals while parked close to the PRFPMI resonance to get an idea of what frequency bands the noise dominated in. The attached power spectrum was taken while CARM and DARM were under ALS control, and the PRMI was locked using REFL_165. The arm power was fluctuating between 15 and 50. Most of the power seems to be in the 1-5Hz band and the 10-30Hz band.
Rana made a number of suggestions, which I'm listing here. Some of these may directly help the above situation, while the others are with regards to the general state of affairs.
- Reroute both (MC and arm) FF signals to the SUS model
- For MC, bypass LSC
- Rethink the MC FF -
- Leave the arm FF on all the time?
- The positioning of the accelerometer used for MC FF has to be bettered - it should be directly below the tank
- The IOO model is over-clocking - needs to be re-examined
- Fix up the DC F2P - Rana mentioned an old (~10 yr) script called F2P ratio, we should look to integrate the Python scripts used for lock-in/demod at the sites with this
- Look to calibrate MC_F
- Implement a high BW CARM servo using ALS
- Gray code implementation for EPICS gain-stepping
|
|