ID |
Date |
Author |
Type |
Category |
Subject |
9513
|
Thu Jan 2 10:15:20 2014 |
Jamie | Summary | General | linux1 RAID crash & recovery | Well done Koji! I'm very impressed with the sysadmin skillz. |
9536
|
Tue Jan 7 23:53:35 2014 |
Jamie | Update | CDS | daqd can't connect to c1vac1, c1vac2 | dadq is logging the following error messages to it's log related to the fact that it can't connect to c1vac1 and c1vac2:
CAC: Unable to connect because "Connection timed out"
CA.Client.Exception...............................................
Warning: "Virtual circuit disconnect"
Context: "c1vac2.martian:5064"
Source File: ../cac.cpp line 1127
Current Time: Tue Jan 07 2014 23:50:53.355609430
..................................................................
CAC: Unable to connect because "Connection timed out"
CA.Client.Exception...............................................
Warning: "Virtual circuit disconnect"
Context: "c1vac1.martian:5064"
Source File: ../cac.cpp line 1127
Current Time: Tue Jan 07 2014 23:50:53.356568469
..................................................................
Not sure if this is related to the full /frames issue that we've been seeing. |
9574
|
Fri Jan 24 13:10:12 2014 |
Jamie | HowTo | LSC | Procedure to measure PRC length |
Quote: |
I wrote a MATLAB script that takes as input the measured distances and produce the optical path lengths. The script also produce a drawing of the setup as reconstructed, showing the measurement points, the mirrors, the reference base plates, and the beam path. Here is an example output, that can be used to understand which are the five distances to be measured. I used dummy measured distances to produce it.

|
This path does not look correct to me. Maybe it's because this is supposed to represent "optical path lengths" as opposed to actual physical location of optics, but I think locations should be checked. For instance, PRM looks like it's floating in mid-air between the BS and ITMX chambers, and PR2 is not located behind ITMX. Actually, come to think of it, it might just be that ITMX (or the ITMs in general) is in the wrong place?
Here is a similar diagram I produced when building a Finesse model of the 40m, based on the CAD drawing that Manasa is maintaining:

|
9966
|
Fri May 16 20:55:18 2014 |
Jamie | Frogs | lore | un-full-screening Ubuntu windows with F11 | Last week Rana and I struggled to figure out how to un-full-screen windows on the Ubuntu workstations that appeared to be stuck in some sort of full screen mode such that the "Titlebar" was not on the screen. Nothing seemed to work. We were in despair.
Well, there is now hope: it appears that this really is a "fullscreen" mode that can be activated by hitting F11. It can therefore easily be undone by hitting F11 again. |
10018
|
Tue Jun 10 09:25:29 2014 |
Jamie | Update | CDS | Computer status: should not be changing names | I really think it's a bad idea to be making all these names changes. You're making things much much harder for yourselves.
Instead of repointing everything to a new host, you should have just changed the DNS to point the name "linux1" to the IP address of the new server. That way you wouldn't need to reconfigure all of the clients. That's the whole point of name service: use a name so that you don't need to point to a number.
Also, pointing to an IP address for this stuff is not a good idea. If the IP address of the server changes, everything will break again.
Just point everything to linux1, and make the DNS entries for linux1 point to the IP address of chiara. You're doing all this work for nothing!
RXA: Of course, I understand what DNS means. I wanted to make the changes to the startup to remove any misconfigurations or spaghetti mount situations (of which we found many). The way the VME162 are designed, changing the name doesn't make the fix - it uses the number instead. And, of course, the main issue was not the DNS, but just that we had to setup RSH on the new machine. This is all detailed in the ELOG entries we've made, but it might be difficult to understand remotely if you are not familiar with the 40m CDS system. |
10033
|
Thu Jun 12 15:31:47 2014 |
Jamie | Update | CDS | Note on cables for talking to slow computers |
Quote: |
We have (now) in the lab 2 cables that are RJ45-DB9. The gray one is LIGO-made, while the blue one is store-bought.
The gray LIGO-made one works, but the blue store-bought one does not. I checked their pinouts, and they are completely different. On the sketch below, the pictures of the connectors is me looking at them face-on, with the cables going out the back of the page. The DB9 is female.
|
There are RJ45-DB9 adapters in the big spinny rack next to the linux1 rack that are for this exact purpose. Just use a stanard ethernet cable with them. |
10040
|
Sun Jun 15 14:26:30 2014 |
Jamie | Omnistructure | CDS | cdsutils re-installed |
Quote: |
CDSUTILS is also gone from the path on all the workstations, so we need Jamie to tell us by ELOG how to set it up, or else we have to use ezcaread / ezcawrite forever.
|
It's in the elog already: http://nodus.ligo.caltech.edu:8080/40m/9922
But it seems like things still haven't fully recovered, or have recovered to an old state? Why is the cdsutils install I previously did in /ligo/apps now missing? It seems like other directories are missing as well.
There's also a user:group issue with the /home/cds mounts. Everything in those mount points is owned nobody:nogroup.
I also can't log into pianosa and rosalba. |
10041
|
Sun Jun 15 14:41:08 2014 |
Jamie | Omnistructure | CDS | cdsutils re-installed |
Quote: |
Quote: |
CDSUTILS is also gone from the path on all the workstations, so we need Jamie to tell us by ELOG how to set it up, or else we have to use ezcaread / ezcawrite forever.
|
It's in the elog already: http://nodus.ligo.caltech.edu:8080/40m/9922
But it seems like things still haven't fully recovered, or have recovered to an old state? Why is the cdsutils install I previously did in /ligo/apps now missing? It seems like other directories are missing as well.
There's also a user:group issue with the /home/cds mounts. Everything in those mount points is owned nobody:nogroup.
I also can't log into pianosa and rosalba.
|
I also still think it's a bad idea for everything to be mounting /home/cds from an IP address. Just make a new DNS entry for linux1 and leave everything as it was. |
10190
|
Sun Jul 13 11:37:36 2014 |
Jamie | Update | Electronics | New Prologix GPIB-Ethernet controller |
Quote: |
I have configured a NEW Prologix GPIB-Ethernet controller to use with HP8591E Spectrum analyzer that sits right next to the control room computers.
Static IP: 192.168.113.109
Mask: 255.255.255.0
Gateway: 192.168.113.2
I have no clue how to give it a name like "something.martian" and to update the martian host table (Somebody please help!!)
|
The instructions for adding a name to the martian DNS table are in the wiki page that I pointed you to:
https://wiki-40m.ligo.caltech.edu/Martian_Host_Table |
10276
|
Sat Jul 26 13:38:34 2014 |
Jamie | Update | General | Data Acquisition from FC into EPICS Channels |
Quote: |
I succeeded in creating a new channel access server hosted on domenica ( R Pi) for continuous data acquisition from the FC into accessible channels. For this I have written a ctypes interface between EPICS and the C interface code to write data into the channels. The channels which I created are:
C1:ALS-X-BEAT-NOTE-FREQ
C1:ALS-Y-BEAT-NOTE-FREQ
The scripts I have written for this can be found in:
db script in: /users/akhil/fcreadoutIoc/fcreadoutApp/Db/fcreadout.db
Python code: /users/akhil/fcreadoutIoc/pycall
C code: /users/akhil/fcreadoutIoc/FCinterfaceCcode.c
I will give the standard channel names(similar to the names on the channel root)once the testing is completed and confirm that data from FC is consistent with the C code readout. Once ready I will run the code forever so that both the server and data acquisition are in process always.
Yesterday, when I set out to test the channel, I faced few serious issues in booting the raspberry pi. However, I have backed up the files on the Pi and will try to debug the issue very soon( I will test with Eric Q's R Pi).
To run these codes one must be root ( sudo python pycall, sudo ./FCinterfaceCcode) because the HID- devices can be written to only by the root(should look into solving this issue).
Instructions for Installation of EPICS, and how to create channel server on Pi will be described in detail in 40m Wiki ( FOLL page).
|
controls@rossa|~ 2> ls /users/akhil/fcreadoutIoc
ls: cannot access /users/akhil/fcreadoutIoc: No such file or directory
controls@rossa|~ 2>
This code should be in the 40m SVN somewhere, not just stored on the RPi.
I'm still confused why python is in the mix here at all. It doesn't make any sense at all that a C program (EPICS IOC) would be calling out to a python program (pycall) that then calls out to a C program (FCinterfaceCcode). That's bad programming. Streamline the program and get rid of python.
You also definitely need to fix whatever the issue is that requires running the program as root. We can't have programs like this run as root. |
11384
|
Tue Jun 30 11:33:00 2015 |
Jamie | Summary | CDS | prepping for CDS upgrade | This is going to be a big one. We're at version 2.5 and we're going to go to 2.9.3.
RCG components that need to be updated:
- mbuf kernel module
- mx_stream driver
- iniChk.pl script
- daqd
- nds
Supporting software:
- EPICS 3.14.12.2_long
- ldas-tools (framecpp) 1.19.32-p1
- libframe 8.17.2
- gds 2.16.3.2
- fftw 3.3.2
Things to watch out for:
- RTS 2.6:
- raw minute trend frame location has changed (CRC-based subdirectory)
- new kernel patch
- RTS 2.7:
- supports "commissioning frames", which we will probably not utilize. need to make sure that we're not writing extra frames somewhere
- RTS 2.8:
- "slow" (EPICS) data from the front-end processes is acquired via DAQ network, and not through EPICS. This will increase traffic on the DAQ lan. Hopefully this will not be an issue, and the existing network infrastructure can handle it, but it should be monitored.
|
11390
|
Wed Jul 1 19:16:21 2015 |
Jamie | Summary | CDS | CDS upgrade in progress | The CDS upgrade is now underway
Here's what's happened so far:
/opt/rtcds/rtscore/tags/advLigoRTS-2.9.4
That's it for today. Will pick up again first thing tomorrow |
11393
|
Tue Jul 7 18:27:54 2015 |
Jamie | Summary | CDS | CDS upgrade: progress! | After a couple of days of struggle, I made some progress on the CDS upgrade today:

Front end status:
- RTS upgraded to 2.9.4, and linked in as "release":
/opt/rtcds/rtscore/release -> tags/advLigoRTS-2.9.4
- mbuf kernel module built installed
- All front ends have been rebooted with the latest patched kernel (from 2.6 upgrade)
- All models have been rebuilt, installed, restarted. Only minor model issues had to be corrected (unterminated unused inputs mostly).
- awgtpman rebuilt, and installed/running on all front-ends
- open-mx upgraded to 1.5.2:
/opt/open-mx -> open-mx-1.5.2
- All front ends running latest version of mx_stream, built against 2.9.4 and open-mx-1.5.2.
We have new GDS overview screens for the front end models:

It's possible that our current lack of IRIG-B GPS distribution means that the 'TIM' status bit will always be red on the IOP models. Will consult with Rolf.
There are other new features in the front ends that I can get into later.
DAQ (fb) status:
- daqd and nds rebuilt against 2.9.4, both now running on fb
40m daqd compile flags:
cd src/daqd
./configure --enable-debug --disable-broadcast --without-myrinet --with-mx --enable-local-timing --with-epics=/opt/rtapps/epics/base --with-framecpp=/opt/rtapps/framecpp
make
make clean
install daqd /opt/rtcds/caltech/c1/target/fb/
However, daqd has unfortunately been very unstable, and I've been trying to figure out why. I originally thought it was some sort of timing issue, but now I'm not so sure.
I had to make the following changes to the daqdrc:
set gps_leaps = 820108813 914803214 1119744016;
That enumerates some list of leap seconds since some time. Not sure if that actually does anything, but I added the latest leap seconds anyway:
set symm_gps_offset=315964803;
This updates the silly, arbitrary GPS offset, that is required to be correct when not using external GPS reference.
Finally, the last thing I did that finally got it running stably was to turn off all trend frame writing:
# start trender;
# start trend-frame-saver;
# sync trend-frame-saver;
# start minute-trend-frame-saver;
# sync minute-trend-frame-saver;
# start raw_minute_trend_saver;
For whatever reason, it's the trend frame writing that that was causing things daqd to fall over after a short amount of time. I'll continue investigating tomorrow.
We still have a lot of cleanup burt restores, testing, etc. to do, but we're getting there. |
11396
|
Wed Jul 8 20:37:02 2015 |
Jamie | Summary | CDS | CDS upgrade: one step forward, two steps back | After determining yesterday that all the daqd issues were coming from the frame writing, I started to dig into it more today. I also spoke to Keith Thorne, and got some good suggestions from Gerrit Kuhn at GEO.
I realized that it probably wasn't the trend writing per se, but that turning on more writing to disk was causing increased load on daqd, and consequently on the system itself. With more frame writing turned on the memory consuption increased to the point of maxing out the physical RAM. The system the probably starting swaping, which certainly would have choked daqd.
I noticed that fb only had 4G of RAM, which Keith suggested was just not enough. Even if the memory consumption of daqd has increased significantly, it still seems like 4G would not be enough. I opened up fb only to find that fb actually had 8G of RAM installed! Not sure what happend to the other 4G, but somehow they were not visible to the system. Koji and I eventually determined, via some frankenstein operations with megatron, that the RAM was just dead. We then pulled 4G of RAM from megatron and replaced the bad RAM in fb, so that fb now has a full 8G of RAM .
Unfortunately, when we got fb fully back up and running we found that fb is not able to see any of the other hosts on the data concentrator network . mx_info, which displays the card and network status for the myricom myrinet fiber card, shows:
MX Version: 1.2.16
MX Build: controls@fb:/opt/src/mx-1.2.16 Tue May 21 10:58:40 PDT 2013
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0: 299.8 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
Status: Running, P0: Wrong Network
Network: Myrinet 10G
MAC Address: 00:60:dd:46:ea:ec
Product code: 10G-PCIE-8AL-S
Part number: 09-03916
Serial number: 352143
Mapper: 00:60:dd:46:ea:ec, version = 0x63e745ee, configured
Mapped hosts: 1
ROUTE COUNT
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- ---
0) 00:60:dd:46:ea:ec fb:0 D 0,0
Note that all front end machines should be listed in the table at the bottom, and they're not. Also note the "Wrong Network" note in the Status line above. It appears that the card has maybe been initialized in a bad state? Or Koji and I somehow disturbed the network when we were cleaning up things in the rack. "sudo /etc/init.d/mx restart" on fb doesn't solve the problem. We even rebooted fb and it didn't seem to help.
In any event, we're back to no data flow. I'll pick up again tomorrow. |
11397
|
Wed Jul 8 21:02:02 2015 |
Jamie | Summary | CDS | CDS upgrade: another step forward, so we're back to where we started (plus a bit?) | Koji did a bit of googling to determine that 'Wrong Network' status message could be explained by the fb myrinet operating in the wrong mode:
(This was the useful link to track down the issue (KA))
Network: Myrinet 10G
I didn't notice it before, but we should in fact be operating in "Ethernet" mode, since that's the fabric we're using for the DC network. Digging a bit deeper we found that the new version of mx (1.2.16) had indeed been configured with a different compile option than the 1.2.15 version had:
controls@fb ~ 0$ grep '$ ./configure' /opt/src/mx-1.2.15/config.log
$ ./configure --enable-ether-mode --prefix=/opt/mx
controls@fb ~ 0$ grep '$ ./configure' /opt/src/mx-1.2.16/config.log
$ ./configure --enable-mx-wire --prefix=/opt/mx-1.2.16
controls@fb ~ 0$
So that would entirely explain the problem. I re-linked mx to the older version (1.2.15), reloaded the mx drivers, and everything showed up correctly:
controls@fb ~ 0$ /opt/mx/bin/mx_info
MX Version: 1.2.12
MX Build: root@fb:/root/mx-1.2.12 Mon Nov 1 13:34:38 PDT 2010
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0: 299.8 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
Status: Running, P0: Link Up
Network: Ethernet 10G
MAC Address: 00:60:dd:46:ea:ec
Product code: 10G-PCIE-8AL-S
Part number: 09-03916
Serial number: 352143
Mapper: 00:60:dd:46:ea:ec, version = 0x00000000, configured
Mapped hosts: 6
ROUTE COUNT
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- ---
0) 00:60:dd:46:ea:ec fb:0 1,0
1) 00:25:90:0d:75:bb c1sus:0 1,0
2) 00:30:48:be:11:5d c1iscex:0 1,0
3) 00:30:48:d6:11:17 c1iscey:0 1,0
4) 00:30:48:bf:69:4f c1lsc:0 1,0
5) 00:14:4f:40:64:25 c1ioo:0 1,0
controls@fb ~ 0$
The front end hosts are also showing good omx info (even though they had been previously as well):
controls@c1lsc ~ 0$ /opt/open-mx/bin/omx_info
Open-MX version 1.5.2
build: controls@fb:/opt/src/open-mx-1.5.2 Tue May 21 11:03:54 PDT 2013
Found 1 boards (32 max) supporting 32 endpoints each:
c1lsc:0 (board #0 name eth1 addr 00:30:48:bf:69:4f)
managed by driver 'igb'
Peer table is ready, mapper is 00:30:48:d6:11:17
================================================
0) 00:30:48:bf:69:4f c1lsc:0
1) 00:60:dd:46:ea:ec fb:0
2) 00:25:90:0d:75:bb c1sus:0
3) 00:30:48:be:11:5d c1iscex:0
4) 00:30:48:d6:11:17 c1iscey:0
5) 00:14:4f:40:64:25 c1ioo:0
controls@c1lsc ~ 0$
This got all the mx_stream connections back up and running.
Unfortunately, daqd is back to being a bit flaky. With all frame writing enabled we saw daqd crash again. I then shut off all trend frame writing and we're back to a marginally stable state: we have data flowing from all front ends, and full frames are being written, but not trends.
I'll pick up on this again tomorrow, and maybe try to rebuild the new version of mx with the proper flags. |
11398
|
Thu Jul 9 13:26:47 2015 |
Jamie | Summary | CDS | CDS upgrade: new mx 1.2.16 installed | I rebuilt/installed mx 1.2.16 to use "ether-mode", instead of the default MX-10G:
controls@fb /opt/src/mx-1.2.16 0$ ./configure --enable-ether-mode --prefix=/opt/mx-1.2.16
...
controls@fb /opt/src/mx-1.2.16 0$ make
..
controls@fb /opt/src/mx-1.2.16 0$ make install
...
I then rebuilt/installed daqd so that it properly linked against the updated mx install:
controls@fb /opt/rtcds/rtscore/release/src/daqd 0$ ./configure --enable-debug --disable-broadcast --without-myrinet --with-mx --with epics=/opt/rtapps/epics/base --with-framecpp=/opt/rtapps/framecpp --enable-local-timing
...
controls@fb /opt/rtcds/rtscore/release/src/daqd 0$ make
...
controls@fb /opt/rtcds/rtscore/release/src/daqd 0$ install daqd /opt/rtcds/caltech/c1/target/fb/
It's now back to running and receiving data from the front ends (still not stable yet, though). |
11400
|
Thu Jul 9 16:50:13 2015 |
Jamie | Summary | CDS | CDS upgrade: if all else fails try throwing metal at the problem | I roped Rolf into coming over and adding his eyes to the problem. After much discussion we couldn't come up with any reasonable explanation for the problems we've been seeing other than daqd just needing a lot more resources that it did before. He said he had some old Sun SunFire X4600s from which we could pilfer memory. I went over to Downs and ripped all the CPU/memory cards out of one of his machines and stuffed them into fb:
fb now has 8 CPU and 16G of RAM
Unfortunately, this is still not enough. Or at least it didn't solve the problem; daqd is showing the same instabilities, falling over a couple of minutes after I turn on trend frame writing. As always, before daqd fails it starts spitting out the following to the logs:
[Thu Jul 9 16:37:09 2015] main profiler warning: 0 empty blocks in the buffer
followed by lines like:
[Thu Jul 9 16:37:27 2015] GPS MISS dcu 44 (ASX); dcu_gps=1120520264 gps=1120519812
right before it dies.
I'm no longer convinced that this is a resource issue, though, judging by the resource usage right before the crash:
top - 16:47:32 up 48 min, 5 users, load average: 0.91, 0.62, 0.61
Tasks: 2 total, 0 running, 2 sleeping, 0 stopped, 0 zombie
Cpu(s): 8.9%us, 0.9%sy, 0.0%ni, 89.1%id, 0.9%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 15952104k total, 13063468k used, 2888636k free, 138648k buffers
Swap: 1023996k total, 0k used, 1023996k free, 7672292k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12016 controls 20 0 8098m 4.4g 104m S 106 29.1 6:45.79 daqd
4953 controls 20 0 53580 6092 5096 S 0 0.0 0:00.04 nds
Load average less than 1 per CPU, plenty of free memory (~3G free, 0 swap), no waiting for IO (0.9%wa), etc. daqd is utilizing lots of threads, which should be spread across many cpus, so even the >100%CPU should be ok. I'm at a loss... |
11402
|
Mon Jul 13 01:11:14 2015 |
Jamie | Summary | CDS | CDS upgrade: current assessment | daqd is still behaving unstably. It's still unclear what the issue is.
The current failures look like disk IO contention. However, it's hard to see any evidince of daqd is suffering from large IO wait while it's failing.
The frame size itself is currently smaller than it was before the upgrade:
controls@fb /frames/full 0$ ls -alth 11190 | head
total 369G
drwxr-xr-x 321 controls controls 36K Jul 12 22:20 ..
drwxr-xr-x 2 controls controls 268K Jun 23 06:06 .
-rw-r--r-- 1 controls controls 67M Jun 23 06:06 C-R-1119099984-16.gwf
-rw-r--r-- 1 controls controls 68M Jun 23 06:06 C-R-1119099968-16.gwf
-rw-r--r-- 1 controls controls 69M Jun 23 06:05 C-R-1119099952-16.gwf
-rw-r--r-- 1 controls controls 69M Jun 23 06:05 C-R-1119099936-16.gwf
-rw-r--r-- 1 controls controls 67M Jun 23 06:05 C-R-1119099920-16.gwf
-rw-r--r-- 1 controls controls 68M Jun 23 06:05 C-R-1119099904-16.gwf
-rw-r--r-- 1 controls controls 68M Jun 23 06:04 C-R-1119099888-16.gwf
controls@fb /frames/full 0$ ls -alth 11208 | head
total 17G
drwxr-xr-x 2 controls controls 20K Jul 13 01:00 .
-rw-r--r-- 1 controls controls 45M Jul 13 01:00 C-R-1120809632-16.gwf
-rw-r--r-- 1 controls controls 50M Jul 13 01:00 C-R-1120809408-16.gwf
-rw-r--r-- 1 controls controls 50M Jul 13 00:56 C-R-1120809392-16.gwf
-rw-r--r-- 1 controls controls 50M Jul 13 00:56 C-R-1120809376-16.gwf
-rw-r--r-- 1 controls controls 50M Jul 13 00:56 C-R-1120809360-16.gwf
-rw-r--r-- 1 controls controls 50M Jul 13 00:55 C-R-1120809344-16.gwf
-rw-r--r-- 1 controls controls 50M Jul 13 00:55 C-R-1120809328-16.gwf
controls@fb /frames/full 0$
This would seem to indicate that it's not an increase in frame size that's to blame.
Because slow data is now transported to daqd over the MX data concentrator network rather than via EPICS (RTS 2.8), there is more network on the MX network. I note also that the channel lists have increased in size:
controls@fb /opt/rtcds/caltech/c1/chans/daq 0$ ls -alt archive/C1LSC* | head -20
-rw-r--r-- 1 4294967294 4294967294 262554 Jul 6 18:21 archive/C1LSC_150706_182146.ini
-rw-r--r-- 1 4294967294 4294967294 262554 Jul 6 18:16 archive/C1LSC_150706_181603.ini
-rw-r--r-- 1 4294967294 4294967294 262554 Jul 6 16:09 archive/C1LSC_150706_160946.ini
-rw-r--r-- 1 4294967294 4294967294 43366 Jul 1 16:05 archive/C1LSC_150701_160519.ini
-rw-r--r-- 1 4294967294 4294967294 43366 Jun 25 15:47 archive/C1LSC_150625_154739.ini
...
I would have thought, though, that data transmission errors would show up in the daqd status bits. |
11404
|
Mon Jul 13 18:12:50 2015 |
Jamie | Summary | CDS | CDS upgrade: left running in semi-stable configuration | I have been watching daqd all day and I don't feel particularly closer to understanding what the issues are. However, things are
Interestingly, though, the stability appears highly variable at the moment. This morning, daqd was very unstable and was crashing within a couple of minutes of starting. However this afternoon, things seemed much more stable. As of this moment, daqd has been running for for 25 minutes now, writing full frames as well as minute and second trends (no minute_raw), without any issues. What has changed?
To reiterate, I have been closing watching disk IO to /frames. I see no indication that there is any disk contention while daqd is failing. It's still possible, though, that there are disk IO issues affecting daqd at a level that is not readily visible. From dstat, the frame writes are visible, but nothing else.
I have made one change that could be positively affecting things right now: I un-exported /frames from NFS. This eliminates anything external from reading /frames over the network. In particular, it also shuts off the transfer of frames to LDAS. Since I've done this, daqd has appeared to be more stable. It's NOT totally stable, though, as the instance that I described above did eventually just die after 43 minutes, as I was writing this.
In any event, as things are currently as stable as I've seen them, I'm leaving it running in this configuration for the moment, with the following relevant daqdrc parameters:
start main 16;
start frame-saver;
sync frame-saver;
start trender 60 60;
start trend-frame-saver;
sync trend-frame-saver;
start minute-trend-frame-saver;
sync minute-trend-frame-saver;
start profiler;
start trend profiler; |
11406
|
Tue Jul 14 09:08:37 2015 |
Jamie | Summary | CDS | CDS upgrade: left running in semi-stable configuration | Overnight daqd restarted itself only about twice an hour, which is an improvement:
controls@fb /opt/rtcds/caltech/c1/target/fb 0$ tail logs/restart.log
daqd: Tue Jul 14 03:13:50 PDT 2015
daqd: Tue Jul 14 04:01:39 PDT 2015
daqd: Tue Jul 14 04:09:57 PDT 2015
daqd: Tue Jul 14 05:02:46 PDT 2015
daqd: Tue Jul 14 06:01:57 PDT 2015
daqd: Tue Jul 14 06:43:18 PDT 2015
daqd: Tue Jul 14 07:02:19 PDT 2015
daqd: Tue Jul 14 07:58:16 PDT 2015
daqd: Tue Jul 14 08:02:44 PDT 2015
daqd: Tue Jul 14 09:02:24 PDT 2015
Un-exporting /frames might have helped a bit. However, the problem is obviously still not fixed. |
11412
|
Tue Jul 14 16:51:01 2015 |
Jamie | Summary | CDS | CDS upgrade: problem is not disk access | I think I have now determined once and for all that the daqd problems are NOT due to disk IO contention.
I have mounted a tmpfs at /frames/tmp and have told daqd to write frames there. The tmpfs exists entirely in RAM. There is essentially zero IO wait for such a filesystem, so daqd should never have trouble writing out the frames.
But yet daqd continues to fail with the "0 empty blocks in the buffer" warnings. I've been down a rabbit hole. |
11415
|
Wed Jul 15 13:19:14 2015 |
Jamie | Summary | CDS | CDS upgrade: reducing mx end-points as last ditch effort | I tried one last thing, suggested by Keith and Gerrit. I tried reducing the number of mx end-points on fb to zero, which should reduce the total number of fb threads, in the hope that the extra threads were causing the chokes.
On Tue, Jul 14 2015, Keith Thorne <kthorne@ligo-la.caltech.edu> wrote:
> Assumptions
> 1) Before the upgrade (from RCG 2.6?), the DAQ had been working, reading out front-ends, writing frames trends
> 2) In upgrading to RCG 2.9, the mx start-up on the frame builder was modified to use multiple end-points
> (i.e. /etc/init.d/mx has a line like
> # 1 10G card - X2
> MX_MODULE_PARAMS="mx_max_instance=1 mx_max_endpoints=16 $MX_MODULE_PARAMS"
> (This can be confirmed by the daqd log file with lines at the top like
> 263596
> MX has 16 maximum end-points configured
> 2 MX NICs available
> [Fri Jul 10 16:12:50 2015] ->4: set thread_stack_size=10240
> [Fri Jul 10 16:12:50 2015] new threads will be created with the stack of size 10
> 240K
>
> If this is the case, the problem may be that the additional thread on the frame-builder (one per end-point) take up so many slots on the 8-core
> frame-builder that they interrupt the frame-writing thread, thus preventing the main buffer from being emptied.
>
> One could go back to a single end-point. This only helps keep restart of front-end A from hiccuping DAQ for front-end B.
>
> You would have to remove code on front-ends (/etc/init.d/mx_stream) that chooses endpoints. i.e.
> # find line number in rtsystab. Use that to mx_stream slot on card (0-15)
> line_num=`grep -v ^# /etc/rtsystab | grep --perl-regexp -n "^${hostname}\s" | se
> d 's/^\([0-9]*\):.*/\1/g'`
> line_off=$(expr $line_num - 1)
> epnum=$(expr $line_off % 2)
> cnum=$(expr $line_off / 2)
>
> start-stop-daemon --start --quiet -b -m --pidfile /var/log/mx_stream0.pid --exec /opt/rtcds/tst/x2/target/x2daqdc0/mx_stream -- -e 0 -r "$epnum" -W 0 -w 0 -s "$sys" -d x2daqdc0:$cnum -l /opt/rtcds/tst/x2/target/x2daqdc0/mx_stream_logs/$hostname.log
As per Keith's suggestion, I modified the mx startup script to only initialize a single endpoint, and I modified the mx_stream startup to point them all to endpoint 0. I verified that indeed daqd was a single MX end-point:
MX has 1 maximum end-points configured
It didn't help. After 5-10 minutes daqd crashes with the same "0 empty blocks" messages.
I should also mention that I'm pretty sure the start of these messages does not seem coincident with any frame writing to disk; further evidence that it's not a disk IO issue.
Keith is looking at the system now, so we if he can see anything obvious. If not, I will start reverting to 2.5. |
11417
|
Wed Jul 15 18:19:12 2015 |
Jamie | Summary | CDS | CDS upgrade: tentative stabilty? | Keith Thorne provided his eyes on the situation today and had some suggestions that might have helped things
Reorder ini file list in master file. Apparently the EDCU.ini file (C0EDCU.ini in our case), which describes EPICS subscriptions to be recorded by the daq, now has to be specified *after* all other front end ini files. It's unclear why, but it has something to do with RTS 2.8 which changed all slow channels to be transported over the mx network. This alone did not fix the problem, though.
Increase second trend frame size. Interestingly, this might have been the key. The second trend frame size was increased to 600 seconds:
start trender 600 60;
The two numbers are the lengths in seconds for the second and minute trends respectively. They had been set to "60 60", but Keith suggested that longer second trend frames are better, for whatever reason. It seems he may be right, given that daqd has been running and writing full and trend frames for 1.5 hours now without issue.
As I'm writing this, though, the daqd just crashed again. I note, though, that it's right after the hour, and immediately following writing out a one hour minute trend file. We've been seeing these hour, on the hour, crashes of daqd for quite a while now. So maybe this is nothing new. I've actually been wondering if the hourly daqd crashes were associated with writing out the minute trend frames, and I think we might have more evidence to point to that.
If increasing the size of the second trend frames from 60 seconds (35M) to 600 seconds (70M) made a difference in stability, could there be an issue since writing out files that are smaller than some value? The full frames are 60M, and the minute trends are 35M. |
11427
|
Sat Jul 18 15:37:19 2015 |
Jamie | Summary | CDS | CDS upgrade: current status | So it appears we have found a semi-stable configuration for the DAQ system post upgrade:

Here are the issues:
daqd
dadq is running mostly stably for the moment, although it still crashes at the top of every hour (see below). Here are some relevant points of about the current configuration:
- recording data from only a subset of front-ends, to reduce the overall load:
- c1x01
- c1scx
- c1x02
- c1sus
- c1mcs
- c1pem
- c1x04
- c1lsc
- c1ass
- c1x05
- c1scy
- 16 second main buffer:
start main 16;
- trend lengths: second: 600, minute: 60
start trender 600 60;
- writing to frames:
- full
- second
- minute
- (NOT raw minute trends)
- frame compression ON
This elliminates most of the random daqd crashing. However, daqd still crashes at the top of every hour after writing out the minute trend frame. Still unclear what the issue is, but Keith is investigating. In some sense this is no worse that where we were before the upgrade, since daqd was also crashing hourly then. It's still crappy, though, so hopefully we'll figure something out.
The inittab on fb automatically restarts daqd after it crashes, and monit on all of the front ends automatically restarts the mx_stream processes.
front ends
The front end modules are mostly running fine.
One issue is that the execution times seem to have increased a bit, which is problematic for models that were already on the hairy edge. For instance, the rough aversage for c1sus has some from ~48us to 50us. This is most problematic for c1cal, which is now running at ~66us out of 60, which is obviously untenable. We'll need to reduce the load in c1cal somehow.
All other front end models seem to be working fine, but a full test is still needed.
There was an issue with the DACs on c1sus, but I rebooted and everything came up fine, optics are now damped:

|
11455
|
Tue Jul 28 17:07:45 2015 |
Jamie | Update | General | Data missing |
Quote: |
For the past couple of days, the summary pages have shown minute trend data disappear at 12:00 UTC (05:00 AM local time). This seems to be the case for all channels that we plot, see e.g. https://nodus.ligo.caltech.edu:30889/detcharsummary/day/20150724/ioo/. Using Dataviewer, Koji has checked that indeed the frames seem to have disappeared from disk. The data come back at 24 UTC (5pm local). Any ideas why this might be?

|
Possible explanations:
- The data transfers to LDAS had been shut off while we were doing the DAQ debugging. I don't know if they have been turned back on. Unlikely this is the problem since you would probably see no data at all if this were the case.
- wiper script parameters might have been changed to store less of the trend data for some reason.
- Frame size is different and therefore wiper script parameters need to be adjusted.
- Steve deleted it all.
- ...
|
13125
|
Wed Jul 19 08:37:21 2017 |
Jamie | Update | CDS | Update on front-end/DAQ rebuild | After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ). The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO. We're trying to get the front ends working first, and will work on recovering daqd after.
Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image. We setup fb1 as the new boot server, and were able to get front ends booting again. Unfortunately, we've been having trouble running and building models, so something is still amis. We've been taking a three-pronged approach to getting the front ends running:
- /diskless/root.fb: This involves booting the front ends from the backup of the diskless root from fb. Runs gentoo kernel 2.6.34.1. This should correspond to the environment that all models were built and running against. But something is missing in the configuration. The front ends were also mounting /opt from fb, which included the dolphin drivers, and we don't have a copy of that, so models aren't loading or recompiling.
- /diskless/root.x1boot: Keith sent a disk image of the entire x1boot server from LLO. It uses gentoo kernel 3.0.8. This ostensibly includes everything we should need to run the front ends, but it's unfortunately configured with newer versions of some of the software and also isn't loading our existing models or building new ones. This also seems to be having issues with the dolphin drivers.
- /diskless/root.jessie: This is an entirely new boot image build from scratch with Debian jessie, using an RTS-patched 3.2 kernel. This would use the latest versions of everything. It's mostly working, we just need to rebuild the dolphin driver and source.
It seems that in all cases we need to rebuild the dolphin drivers from source. |
13127
|
Wed Jul 19 14:26:50 2017 |
Jamie | Update | CDS | Update on front-end/DAQ rebuild |
Quote: |
After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ). The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO. We're trying to get the front ends working first, and will work on recovering daqd after.
Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image. We setup fb1 as the new boot server, and were able to get front ends booting again. Unfortunately, we've been having trouble running and building models, so something is still amis. We've been taking a three-pronged approach to getting the front ends running:
- /diskless/root.fb: This involves booting the front ends from the backup of the diskless root from fb. Runs gentoo kernel 2.6.34.1. This should correspond to the environment that all models were built and running against. But something is missing in the configuration. The front ends were also mounting /opt from fb, which included the dolphin drivers, and we don't have a copy of that, so models aren't loading or recompiling.
- /diskless/root.x1boot: Keith sent a disk image of the entire x1boot server from LLO. It uses gentoo kernel 3.0.8. This ostensibly includes everything we should need to run the front ends, but it's unfortunately configured with newer versions of some of the software and also isn't loading our existing models or building new ones. This also seems to be having issues with the dolphin drivers.
- /diskless/root.jessie: This is an entirely new boot image build from scratch with Debian jessie, using an RTS-patched 3.2 kernel. This would use the latest versions of everything. It's mostly working, we just need to rebuild the dolphin driver and source.
It seems that in all cases we need to rebuild the dolphin drivers from source.
|
To clarify, we're able to boot the x1boot image with the existing 2.6.25 kernel that we have from fb. The issue with the root.x1boot image is not the kernel version but some of the other support libraries, such as dolphin. |
13130
|
Fri Jul 21 18:03:17 2017 |
Jamie | Update | CDS | Update on front-end/DAQ rebuild | Update:
- front ends booting with the new Debian jessie diskless root image and a linux 3.2 version of the RTS-patched kernel
- dolphin is configured correctly and running on c1lsc and c1sus
- models building and running with RCG 3.0.3
Up next:
- add c1ioo to the dolphin network
- recompile/restart all front end models
- daqd
I'll try to get the first two of those done tomorrow, although it's unclear what model updates we'll have to do to get things working with the newer RCG.
|
13132
|
Sun Jul 23 15:00:28 2017 |
Jamie | Omnistructure | VAC | strange sound around X arm vacuum pumps | While walking down to the X end to reset c1iscex I heard what I would call a "rythmic squnching" sound coming from under the turbo pump. I would have said the sound was coming from a roughing pump, but none of them are on (as far as I can tell).
Steve maybe look into this?? |
13136
|
Mon Jul 24 10:59:08 2017 |
Jamie | Update | CDS | c1iscex models died |
Quote: |
This morning, all the c1iscex models were dead. Attachment #1 shows the state of the cds overview screen when I came in. The machine itself was ssh-able, so I just restarted all the models and they came back online without fuss.
|
This was me. I had rebooted that machine and hadn't restarted the models. Sorry for the confusion. |
13138
|
Mon Jul 24 19:28:55 2017 |
Jamie | Update | CDS | front end MX stream network working, glitches in c1ioo fixed | MX/OpenMX network running
Today I got the mx/open-mx networking working for the front ends. This required some tweaking to the network interface configuration for the diskless front ends, and recompiling mx and open-mx for the newer kernel. Again, this will all be documented.
controls@fb1:~ 0$ /opt/mx/bin/mx_info
MX Version: 1.2.16
MX Build: root@fb1:/opt/src/mx-1.2.16 Mon Jul 24 11:33:57 PDT 2017
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0: 364.4 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
Status: Running, P0: Link Up
Network: Ethernet 10G
MAC Address: 00:60:dd:43:74:62
Product code: 10G-PCIE-8B-S
Part number: 09-04228
Serial number: 485052
Mapper: 00:60:dd:43:74:62, version = 0x00000000, configured
Mapped hosts: 6
ROUTE COUNT
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- ---
0) 00:60:dd:43:74:62 fb1:0 1,0
1) 00:30:48:be:11:5d c1iscex:0 1,0
2) 00:30:48:bf:69:4f c1lsc:0 1,0
3) 00:25:90:0d:75:bb c1sus:0 1,0
4) 00:30:48:d6:11:17 c1iscey:0 1,0
5) 00:14:4f:40:64:25 c1ioo:0 1,0
controls@fb1:~ 0$
c1ioo timing glitches fixed
I also checked the BIOS on c1ioo and found that the serial port was enabled, which is known to cause timing glitches. I turned off the serial port (and some power management stuff), and rebooted, and all the c1ioo timing glitches seem to have gone away.
It's unclear why this is a problem that's just showing up now. Serial ports have always been a problem, so it seems unlikely this is just a problem with the newer kernel. Could the BIOS have somehow been reset during the power glitch?
In any event, all the front ends are now booting cleanly, with all dolphin and mx networking coming up automatically, and all models running stably:

Now for daqd... |
13145
|
Wed Jul 26 19:13:07 2017 |
Jamie | Update | CDS | daqd showing same instability as before | I recompiled daqd on the updated fb1, similar to how I had before, and we're seeing the same instability: process crashes when it tries to write out the second trend (technically it looks like it crashes while it's trying to write out the full frame while the second trend is also being written out). Jonathan Hanks and I are actively looking into it and i'll provide further report soon. |
13149
|
Fri Jul 28 20:22:41 2017 |
Jamie | Update | CDS | possible stable daqd configuration with separate DC and FW | This week Jonathan Hanks and I have been trying to diagnose why the daqd has been unstable in the configuration used by the 40m, with data concentrator (dc) and frame writer (fw) in the same process (referred to generically as 'fb'). Jonathan has been digging into the core dumps and source to try to figure out what's going on, but he hasn't come up with anything concrete yet.
As an alternative, we've started experimenting with a daqd configuration with the dc and fw components running in separate processes, with communication over the local loopback interface. The separate dc/fw process model more closely matches the configuration at the sites, although the sites put dc and fwprocesses on different physical machines. Our experimentation thus far seems to indicate that this configuration is stable, although we haven't yet tested it with the full configuration, which is what I'm attempting to do now.
Unfortunately I'm having trouble with the mx_stream communication between the front ends and the dc process. The dc does not appear to be receiving the streams from the front ends and is producing a '0xbad' status message for each. I'm investigating. |
13153
|
Mon Jul 31 18:44:40 2017 |
Jamie | Update | CDS | CDS system essentially fully recovered | The CDS system is mostly fully recovered at this point. The mx_streams are all flowing from all front ends, and from all models, and the daqd processes are receiving them and writing the data to frames:

Remaining unresolved issues:
- IFO needs to be fully locked to make sure ALL components of all models are working.
- The remaining red status lights are from the "FB NET" diagnostics, which are reflecting a missing status bit from the front end processes due to the fact that they were compiled with an earlier RCG version (3.0.3) than the mx_streams were (3.3+/trunk). There will be a new release of the RTS soon, at which point we'll compile everything from the same version, which should get us all green again.
- The entire system has been fully modernized, to the target CDS reference OS (Debian jessie) and more recent RCG versions. The management of the various RTS components, both on the front ends and on fb, have as much as possible been updated to use the modern management tools (e.g. systemd, udev, etc.). These changes need to be documented. In particular...
- The fb daqd process has been split into three separate components, a configuration that mirrors what is done at the sites and appears to be more stable: The "target" directory for all of these components is now:
- daqd_dc: data concentrator (receives data from front ends)
- daqd_fw: receives frames from dc and writes out full frames and second/minute trends
- daqd_rcv: NDS1 server (raises test points and receives archive data from frames from 'nds' process)
The "target" directory for all of these new components is:
- /opt/rtcds/caltech/c1/target/daqd
All of these processes are now managed under systemd supervision on fb, meaning the daqd restart procedure has changed. This needs to be simplified and clarified.
- Second trend frames are being written, but for some reason they're not accessible over NDS.
- Have not had a chance to verify minute trend and raw minute trend writing yet. Needs to be confirmed.
- Get wiper script working on new fb.
- Front end RTS kernel will occaissionally crash when the RTS modules are unloaded. Keith Thorne apparently has a kernel version with a different set of patches from Gerrit Kuhn that does not have this problem. Keith's kernel needs to be packaged and installed in the front end diskless root.
- The models accessing the dolphin shared memory will ALL crash when one of the front end hosts on the dolphin network goes away. This results in a boot fest of all the dolphin-enabled hosts. Need to figure out what's going on there.
- The RCG settings snapshotting has changed significantly in later RCG versions. We need to make sure that all burt backup type stuff is still working correctly.
- Restoration of /frames from old fb SCSI RAID?
- Backup of entirety of fb1, including fb1 root (/) and front end diskless root (/diskless)
- Full documentation of rebuild procedure from Jamie's notes.
|
13164
|
Thu Aug 3 19:46:27 2017 |
Jamie | Update | CDS | new daqd restart procedure | This is the daqd restart procedure:
$ ssh fb1 sudo systemctl restart daqd_*
That will restart all of the daqd services (daqd_dc, daqd_fw, daqd_rcv).
The front end mx_stream processes should all auto-restart after the daqd_dc comes back up. If they don't (models show "0x2bad" on DC0_*_STATUS) then you can execute the following to restart the mx_stream process on the front end:
$ ssh c1<host> sudo systemctl restart mx_stream
|
13165
|
Thu Aug 3 20:15:11 2017 |
Jamie | Update | CDS | dataviewer can not raise test points | For some reason dataviewer is not able to raise test points with the new daqd setup, even though dtt can. If you raise a test point with dtt then dataviewer can show the data fine.
It's unclear to me why this would be the case. It might be that all the versions of dataviewer on the workstations are too old?? I'll look into it tomorrow to see if I can figure out what's going on. |
13198
|
Fri Aug 11 19:34:49 2017 |
Jamie | Update | CDS | CDS final bits status update | So it appears we now have full frames and second, minute, and minute_raw trends.
We are still not able to raise test points with daqd_rcv (e.g. the NDS1 server), which is why dataviewer and nds2-client can't get test points on their own.
We were not able to add the EDCU (EPICS client) channels without daqd_fw crashing.
We have a new kernel image that's supposed to solve the module unload instability issue. In order to try it we'll need to restart the entire system, though, so I'll do that on Monday morning.
I've got the CDS guys investigating the test point and EDCU issues, but we won't get any action on that until next week.
Quote: |
Remaining unresolved issues:
- IFO needs to be fully locked to make sure ALL components of all models are working.
- The remaining red status lights are from the "FB NET" diagnostics, which are reflecting a missing status bit from the front end processes due to the fact that they were compiled with an earlier RCG version (3.0.3) than the mx_streams were (3.3+/trunk). There will be a new release of the RTS soon, at which point we'll compile everything from the same version, which should get us all green again.
- The entire system has been fully modernized, to the target CDS reference OS (Debian jessie) and more recent RCG versions. The management of the various RTS components, both on the front ends and on fb, have as much as possible been updated to use the modern management tools (e.g. systemd, udev, etc.). These changes need to be documented. In particular...
- The fb daqd process has been split into three separate components, a configuration that mirrors what is done at the sites and appears to be more stable: The "target" directory for all of these components is now:
- daqd_dc: data concentrator (receives data from front ends)
- daqd_fw: receives frames from dc and writes out full frames and second/minute trends
- daqd_rcv: NDS1 server (raises test points and receives archive data from frames from 'nds' process)
The "target" directory for all of these new components is:
- /opt/rtcds/caltech/c1/target/daqd
All of these processes are now managed under systemd supervision on fb, meaning the daqd restart procedure has changed. This needs to be simplified and clarified.
- Second trend frames are being written, but for some reason they're not accessible over NDS.
- Have not had a chance to verify minute trend and raw minute trend writing yet. Needs to be confirmed.
- Get wiper script working on new fb.
- Front end RTS kernel will occaissionally crash when the RTS modules are unloaded. Keith Thorne apparently has a kernel version with a different set of patches from Gerrit Kuhn that does not have this problem. Keith's kernel needs to be packaged and installed in the front end diskless root.
- The models accessing the dolphin shared memory will ALL crash when one of the front end hosts on the dolphin network goes away. This results in a boot fest of all the dolphin-enabled hosts. Need to figure out what's going on there.
- The RCG settings snapshotting has changed significantly in later RCG versions. We need to make sure that all burt backup type stuff is still working correctly.
- Restoration of /frames from old fb SCSI RAID?
- Backup of entirety of fb1, including fb1 root (/) and front end diskless root (/diskless)
- Full documentation of rebuild procedure from Jamie's notes.
|
|
13205
|
Mon Aug 14 19:41:46 2017 |
Jamie | Update | CDS | front-end/DAQ network down for kernel upgrade, and timing errors | I'm upgrading the linux kernel for all the front ends to one that is supposedly more stable and won't freeze when we unload RTS models (linux-image-3.2.88-csp). Since it's a different kernel version it requires rebuilds of all kernel-related support stuff (mbuf, symmetricom, mx, open-mx, dolphin) and all the front end models. All the support stuff has been upgraded, but we're now waiting on the front end rebuilds, which takes a while.
Initial testing indicates that the kernel is more stable; we're mostly able to unload/reload RTS modules without the kernel freezing. However, the c1iscey host seems to be oddly problematic and has frozen twice so far on module unloads. None of the other hosts have frozen on unload (yet), though, so still not clear.
We're now seeing some timing errors between the front ends and daqd, resulting in a "0x4000" status message in the 'C1:DAQ-DC0_*_STATUS' channels. Part of the problem was an issue with the IRIG-B/GPS receiver timing unit, which I'll log in a separate post. Another part of the problem was a bug in the symmetricom driver, which has been resolved. That wasn't the whole problem, though, since we're still seeing timing errors. Working with Jonathan to resolve. |
13215
|
Wed Aug 16 17:05:53 2017 |
Jamie | Update | CDS | front-end/DAQ network down for kernel upgrade, and timing errors | The CDS system has now been up moved to a supposedly more stable real-time-patched linux kernel (3.2.88-csp) and RCG r4447 (roughly the head of trunk, intended to be release 3.4). With one major and one minor exception, everything seems to be working:

The remaining issues are:
- RFM network down. The IOP models on all hosts on the RFM network are not detecting their RFM cards. Keith Thorne thinks that this is because of changes in trunk to support the new long-range PCIe that will be used at the sites, and that we just need to add a new parameter to the cdsParameters block in models that use RFM. Him and Rolf are looking into for us.
- The 3.2.88-csp kernel is still not totally stable. On most hosts (c1sus, c1ioo, c1iscex) it seems totally fine and we're able to load/unload models without issue. c1iscey is definitely problematic, frequently freezing on module unload. There must be a hardware/bios issue involved here. c1lsc has also shown some problems. A better kernel is supposedly in the works.
- NDS clients other than DTT are still unable to raise test points. This appears to be an issue with the daqd_rcv component (i.e. NDS server) not properly resolving the front ends in the GDS network. Still looking into this with Keith, Rolf, and Jonathan.
Issues that have been fixed:
- "EDCU" channels, i.e. non-front-end EPICS channels, are now being acquired properly by the DAQ. The front-ends now send all slow channels to the daq over the MX network stream. This means that front end channels should no longer be specified in the EDCU ini file. There were a couple in there that I removed, and that seemed to fix that issue.
- Data should now be recorded in all formats: full frames, as well as second, minute, and raw_minute trends
- All FE and DAQD diagnostics are green (other than the ones indicating the problems with the RFM network). This was fixed by getting the front ends models, mx_stream processes, and daqd processes all compiled against the same version of the advLigoRTS, and adding the appropriate command line parameters to the mx_stream processes.
|
13217
|
Wed Aug 16 18:01:28 2017 |
Jamie | Update | CDS | front-end/DAQ network down for kernel upgrade, and timing errors |
Quote: |
What's the current backup situation?
|
Good question. We need to figure something out. fb1 root is on a RAID1, so there is one layer of safety. But we absolutely need a full backup of the fb1 root filesystem. I don't have any great suggestions, other than just getting an external disk, 1T or so, and just copying all of root (minus NFS mounts). |
13219
|
Wed Aug 16 18:50:58 2017 |
Jamie | Update | CDS | front-end/DAQ network down for kernel upgrade, and timing errors |
Quote: |
The remaining issues are:
- RFM network down. The IOP models on all hosts on the RFM network are not detecting their RFM cards. Keith Thorne thinks that this is because of changes in trunk to support the new long-range PCIe that will be used at the sites, and that we just need to add a new parameter to the cdsParameters block in models that use RFM. Him and Rolf are looking into for us.
|
RFM network is back! Everything green again.

Use of RFM has been turned off in adLigoRTS trunk in favor of the new long-range PCIe networking being developed for the sites. Rolf provided a single-line patch that re-enables it:
controls@c1sus:/opt/rtcds/rtscore/trunk 0$ svn diff
Index: src/epics/util/feCodeGen.pl
===================================================================
--- src/epics/util/feCodeGen.pl (revision 4447)
+++ src/epics/util/feCodeGen.pl (working copy)
@@ -122,7 +122,7 @@
$diagTest = -1;
$flipSignals = 0;
$virtualiop = 0;
-$rfm_via_pcie = 1;
+$rfm_via_pcie = 0;
$edcu = 0;
$casdf = 0;
$globalsdf = 0;
controls@c1sus:/opt/rtcds/rtscore/trunk 0$
This patched was applied to RTS source checkout we're using for the FE builds (/opt/rtcds/rtscore/trunk, which is r4447, and is linked to /opt/rtcds/rtscore/release). The following models that use RFM were re-compiled, re-installed, and re-started:
- c1x02
- c1rfm
- c1x03
- c1als
- c1x01
- c1scx
- c1asx
- c1x05
- c1scy
- c1tst
The re-compiled models now see the RFM cards (dmesg log from c1ioo):
[24052.203469] c1x03: Total of 4 I/O modules found and mapped
[24052.203471] c1x03: ***************************************************************************
[24052.203473] c1x03: 1 RFM cards found
[24052.203474] c1x03: RFM 0 is a VMIC_5565 module with Node ID 180
[24052.203476] c1x03: address is 0xffffc90021000000
[24052.203478] c1x03: ***************************************************************************
This cleared up all RFM transmission error messages.
CDS upstream are working to make this RFM usage switchable in a reasonable way. |
13258
|
Mon Aug 28 08:47:32 2017 |
Jamie | Summary | LSC | First cavity length reconstruction with a neural network |
truly. |
15742
|
Mon Dec 21 09:28:50 2020 |
Jamie | Configuration | CDS | Updated CDS upgrade plan |
Quote: |
Attached is the layout for the "intermediate" CDS upgrade option, as was discussed on Wednesday. Under this plan:
-
Existing FEs stay where they are (they are not moved to a single rack)
-
Dolphin IPC remains PCIe Gen 1
-
RFM network is entirely replaced with Dolphin IPC
Please send me any omissions or corrections to the layout.
|
I just want to point out that if you move all the FEs to the same rack they can all be connected to the Dolphin switch via copper, and you would only have to string a single fiber to every IO rack, rather than the multiple now (for network, dolphin, timing, etc.). |
16299
|
Wed Aug 25 18:20:21 2021 |
Jamie | Update | CDS | GPS time on fb1 fixed, dadq writing correct frames again | I have no idea what happened to the GPS timing on fb1, but it seems like the issue was coincident with the power glitch on Monday.
As was noted by Koji above, the GPS time kernel interface was off by a year, which was causing the frame builder to write out files with the wrong names. fb1 was using DAQD components from the advligorts 3.3 release, which used the old "symmetricom" kernel module for the GPS time. This old module was also known to have issues with time offsets. This issue is remniscent of previous timing issues with the DAQ on fb1.
I noted that a newer version of the advligorts, version 3.4, was available on debian jessie, the system running on fb1. advligorts 3.4 includes a newer version of the GPS time module, renamed gpstime. I checked with Jonathan Hanks that the interfaces did not change between 3.3 and 3.4, and 3.4 was mostly a bug fix and packaging release, so I decided to upgrade the DAQ to get the new components. I therefore did the following
-
updated the archive info in /etc/apt/sources.list.d/cdssoft.list, and added the "jessie-restricted" archive which includes the mx packages: https://git.ligo.org/cds-packaging/docs/-/wikis/home
-
removed the symmetricom module from the kernel
sudo rmmod symmetricom
-
upgraded the advligorts-daqd components (NOTE I did not upgrade the rest of the system, although there are outstanding security upgrades needed):
sudo apt install advligorts-daqd advligorts-daqd-dc-mx
-
loaded the new gpstime module and checked that the GPS time was correct:
sudo modprobe gpstime
-
restarted all the daqd processes
sudo systemctl restart daqd_*
Everything came up fine at that point, and I checked that the correct frames were being written out. |
16302
|
Thu Aug 26 10:30:14 2021 |
Jamie | Configuration | CDS | front end time synchronization fixed? | I've been looking at why the front end NTP time synchronization did not seem to be working. I think it might not have been working because the NTP server the front ends were point to, fb1, was not actually responding to synchronization requests.
I cleaned up some things on fb1 and the front ends, which I think unstuck things.
On fb1:
- stopped/disabled the default client (systemd-timesyncd), and properly installed the full NTP server (ntp)
- the ntp server package for debian jessie is old-style sysVinit, not systemd. In order to make it more integrated I copied the auto-generated service file to /etc/systemd/system/ntp.service, and added and "[install]" section that specifies that it should be available during the default "multi-user.target".
- "enabled" the new service to auto-start at boot ("sudo systemctl enable ntp.service")
- made sure ntp was configured to serve the front end network ('broadcast 192.168.123.255') and then restarted the server ("sudo systemctl restart ntp.service")
For the front ends:
- on fb1 I chroot'd into the front-end diskless root (/diskless/root) and manually specifed that systemd-timesyncd should start on boot by creating a symlink to the timesyncd service in the multi-user.target directory:
$ sudo chroot /diskless/root
$ cd /etc/systemd/system/multi-user.target.wants
$ ln -s /lib/systemd/system/systemd-timesyncd.service
- on the front end itself (c1iscex as a test) I did a "systemctl daemon-reload" to force it to reload the systemd config, and then restarted the client ("systemctl restart systemd-timesyncd")
- checked the NTP synchronization with timedatectl:
controls@c1iscex:~ 0$ timedatectl
Local time: Thu 2021-08-26 11:35:10 PDT
Universal time: Thu 2021-08-26 18:35:10 UTC
RTC time: Thu 2021-08-26 18:35:10
Time zone: America/Los_Angeles (PDT, -0700)
NTP enabled: yes
NTP synchronized: yes
RTC in local TZ: no
DST active: yes
Last DST change: DST began at
Sun 2021-03-14 01:59:59 PST
Sun 2021-03-14 03:00:00 PDT
Next DST change: DST ends (the clock jumps one hour backwards) at
Sun 2021-11-07 01:59:59 PDT
Sun 2021-11-07 01:00:00 PST
controls@c1iscex:~ 0$
Note that it is now reporting "NTP enabled: yes" (the service is enabled to start at boot) and "NTP synchronized: yes" (synchronization is happening), neither of which it was reporting previously. I also note that the systemd-timesyncd client service is now loaded and enabled, is no longer reporting that it is in an "Idle" state and is in fact reporting that it synchronized to the proper server, and it is logging updates:
controls@c1iscex:~ 0$ sudo systemctl status systemd-timesyncd
â— systemd-timesyncd.service - Network Time Synchronization
Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled)
Active: active (running) since Thu 2021-08-26 10:20:11 PDT; 1h 22min ago
Docs: man:systemd-timesyncd.service(8)
Main PID: 2918 (systemd-timesyn)
Status: "Using Time Server 192.168.113.201:123 (ntpserver)."
CGroup: /system.slice/systemd-timesyncd.service
└─2918 /lib/systemd/systemd-timesyncd
Aug 26 10:20:11 c1iscex systemd[1]: Started Network Time Synchronization.
Aug 26 10:20:11 c1iscex systemd-timesyncd[2918]: Using NTP server 192.168.113.201:123 (ntpserver).
Aug 26 10:20:11 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 64s/+0.000s/0.000s/0.000s/+26ppm
Aug 26 10:21:15 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 128s/-0.000s/0.000s/0.000s/+25ppm
Aug 26 10:23:23 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 256s/+0.001s/0.000s/0.000s/+26ppm
Aug 26 10:27:40 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 512s/+0.003s/0.000s/0.001s/+29ppm
Aug 26 10:36:12 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 1024s/+0.008s/0.000s/0.003s/+33ppm
Aug 26 10:53:16 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 2048s/-0.026s/0.000s/0.010s/+27ppm
Aug 26 11:27:24 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 2048s/+0.009s/0.000s/0.011s/+29ppm
controls@c1iscex:~ 0$
So I think this means everything is working.
I then went ahead and reloaded and restarted the timesyncd services on the rest of the front ends.
We still need to confirm that everything comes up properly the next time we have an opportunity to reboot fb1 and the front ends (or the opportunity is forced upon us).
There was speculation that the NTP clients on the front ends (systemd-timesyncd) would not work on a read-only filesystem, but this doesn't seem to be true. You can't trust everything you read on the internet. |
17109
|
Sun Aug 28 23:14:22 2022 |
Jamie | Update | Computers | rack reshuffle proposal for CDS upgrade | @tega This looks great, thank you for putting this together. The rack drawing in particular is great. Two notes:
- In "1X6 - proposed" I would move the "PEM AA + ADC Adapter" down lower in the rack, maybe where "Old FB + JetStor" are, after removing those units since they're no longer needed. That would keep all the timing stuff together at the top without any other random stuff in between them. If we can't yet remove Old FB and the JetStor then I would move the VME GPS/Timing chassis up a couple units to make room for the PEM module between the VME chassis and FB1.
- We'll eventually want to move FB1 and Megatron into 1X7, since it seems like there will be room there. That will put all the computers into one rack, which will be very nice. FB1 should also be on the KVM switch as well.
I think most of this work can be done with very little downtime. |
17379
|
Tue Jan 3 12:10:46 2023 |
Jamie | Update | CDS | Yearly DAQD fix 2023, did not work :( | This whole procedure is no longer needed after the CDS upgrade. We don't do things in the old hacky way anymore. The gpstime kernel module does not need to be hacked anymore, and it *should* keep up with leap seconds properly...
That said, there was an issue that @christopher.wipf id'd properly:
- The front end models were not getting the correct GPS time because the offset on their timing cards was off by a year. This is because the timing cards do not get the year info from IRIG-B, so the offset they had when the models were loaded in 2022 was no longer valid as soon as the year rolled over to 2023.
- This meant that the data sent from the front ends to the DAQ was a year old, and the DAQ receiver process (cps_recv) was silently dropping the packets because the time stamps were too off.
So to resolve the issue the /usr/share/advligorts/calc_gps_offset.py program needed to be run on the front ends, and all the models needed to be restarted (or all the front ends need to be restarted). Once that happens the data should start flowing properly.
NOTE the DAQD no longer uses the gpstime kernel module, so the fact that the gpstime module was reporting the wrong time on fb1 did not affect anything.
Anchel restarted the front ends which fixed the issue.
We got new timing hardware from Keith at LLO which will hopefully resolve the year issue (if it's ever installed) so that we won't have to go through these shenanigans again when the year rolls over. The sites don't have to deal with this anymore.
Issues reported upstream:
- better logging from the cps_recv process to report why there is no data coming through
- fix packaging for advligorts-daqd to not depend on the gpstime kernel module
Also note that the code in /opt/rtcds/rtscore is completely obsolete and is not used by anything and that whole directory should be purged |
7173
|
Tue Aug 14 11:33:14 2012 |
Jamie Alex Den | Update | CDS | AI and AA filters | When signals are transmitted between the models running at different rates, no AI or AA filters are automatically applied. We need to fix our models.

|
10188
|
Fri Jul 11 22:02:52 2014 |
Jamie, Chris | Omnistructure | CDS | cdsutils: multifarious upgrades | To make the latest cdsutils available in the control room, we've done the following:
Upgrade pianosa to Ubuntu 12 (cdsutils depends on python2.7, not found in the previous release)
- Enable distribution upgrades in the Ubuntu Software Center prefs
- Check for updates in the Update Manager and click the big "Upgrade" button
Note that rossa remains on Ubuntu 10 for now.
Upgrade cdsutils to r260
- Instructions here
- cdsutils-238 was left as the default pointed to by the cdsutils symlink, for rossa's sake
Built and installed the nds2-client (a cdsutils dependency)
- Checked out the source tree from svn into /ligo/svncommon/nds2
- Built tags/nds2_client_0_10_5 (install instructions are here; build dependencies were installed by apt-get on chiara)
- ./configure --prefix=/ligo/apps/ubuntu12/nds2-client-0.10.5; make; make install
- In /ligo/apps/ubuntu12: ln -s nds2-client-0.10.5 nds2-client
nds2-client was apparently installed locally as a deb in the past, but the version in lscsoft seems broken currently (unknown symbols?). We should revisit this.
Built and installed pyepics (a cdsutils dependency)
- Download link to ~/src on chiara
- python setup.py build; python setup.py install --prefix=/ligo/apps/ubuntu12/pyepics-3.2.3
- In /ligo/apps/ubuntu12: ln -s pyepics-3.2.3 pyepics
pyepics was also installed as deb before; should revisit when Jamie gets back.
Added the gqrx ppa and installed gnuradio (dependency of the waterfall plotter)
Added a test in /ligo/apps/ligoapps-user-env.sh to load the new cdsutils only on Ubuntu 12.
The end result:
controls@chiara|~ > z
usage: cdsutils
Advanced LIGO Control Room Utilites
Available commands:
read read EPICS channel value
write write EPICS channel value
switch switch buttons in standard LIGO filter module
avg average one or more NDS channels for a specified amount of seconds
servo servos channel with a simple integrator (pole at zero)
trigservo servos channel with a simple integrator (pole at zero)
audio Play channel as audio stream.
dv Plot time series of channels from NDS.
water Live waterfall plotter for LIGO data
version print version info and exit
help this help
Add '-h' after individual commands for command help.
|
13207
|
Mon Aug 14 20:12:09 2017 |
Jamie, Gautum | Update | CDS | Weird problem with GPS receiver | Today we saw a weird issue with the GPS receiver (EndRun Technologies Tempus LX). GPS timing on fb1 (which is handled via IRIG-B connection to the receiver with a spectracom card) was off by +18 seconds. We tried resetting the GPS receiver and it still came up with +18 second offset. To be clear, the GPS receiver unit itself was showing a time on it's front panel that looked close enough to 24-hour UTC, but was off by +18s. The time also said "GPS" vertically to the right of the time.
We started exploring the settings on the GPS receiver and found this menu item:
Clock -> "Time Mode" -> "UTC"/"GPS"/"Local"
The setting when we found it was "GPS", which seems logical enough. However, when we switched it to "UTC" the time as shown on the front panel was correct, now with "UTC" vertically to the right of the time, and fb1 was then showing the correct GPS time.
From the manual:
Time Mode
Time mode defines the time format used for the front-panel time display and, if installed, the optional
time code or Serial Time output. The time mode does not affect the NTP output, which is always
UTC. Possible values for the time mode are GPS, UTC, and local time. GPS time is derived from
the GPS satellite system. UTC is GPS time minus the current leap second correction. Local time is
UTC plus local offset and Daylight Savings Time. The local offset and daylight savings time displays
are described below.
The fact that moving to "UTC" fixed the problem, even though that is supposed to remove the leap second correction, might indicate that there's another bug in the symmetricom driver... |
|