ID |
Date |
Author |
Type |
Category |
Subject |
8656
|
Thu May 30 11:28:34 2013 |
Jamie | Configuration | CDS | c1als model cleanup | The c1als model was pulling out some ADC0 connections that were no longer used for anything:
- ADC_0_1 --> sfm "FD" --> IPC "C1:ALS-SCX_FD"
- ADC_0_5 --> sfm "OCX" --> term
- ADC_0_6 --> sfm "ADC" --> term
The channels would have shown up as C1:ALS-FD, C1:ALS-OCX, C1:ALS-ADC. The IPC connection that presumably was meant to go to c1scx is not connected on the other end.
I removed all this stuff from the model and rebuilt/restarted. |
8657
|
Thu May 30 11:33:26 2013 |
Jamie | Configuration | Computer Scripts / Programs | ASS medm/model changes need to be committed to SVN | There are a lot of changes to the ASS stuff that have not been committed to the SVN:
controls@rossa:/opt/rtcds/userapps/release/isc/c1 0$ svn status | grep -v '?'
M medm/c1als/C1ALS_X_SLOW.adl
D medm/c1ass/C1ASS_TRY_YAW_LOCKIN.adl
D medm/c1ass/ASS_SERVOS.adl
D medm/c1ass/ctrl_yaw_mtrx.adl
D medm/c1ass/C1ASS_QPDS.adl
D medm/c1ass/C1ASS_SEN_YAW_MTRX.adl
M medm/c1ass/C1ASS_XARM_SEN_MTRX.adl
D medm/c1ass/SITEMODEL_LOCKINNAME.adl
D medm/c1ass/C1ASS_TRX_YAW_LOCKIN.adl
D medm/c1ass/C1ASS_LOCKIN1.adl
D medm/c1ass/C1ASS_LOCKIN2.adl
D medm/c1ass/C1ASS_LOCKIN3.adl
D medm/c1ass/C1ASS_LOCKIN4.adl
D medm/c1ass/C1ASS_LOCKIN5.adl
D medm/c1ass/C1ASS_LOCKIN6.adl
D medm/c1ass/C1ASS_LOCKIN7.adl
D medm/c1ass/C1ASS_LOCKIN8.adl
D medm/c1ass/C1ASS_LOCKIN9.adl
D medm/c1ass/C1ASS_REFL11I_PIT_LOCKIN.adl
M medm/c1ass/C1ASS.adl
D medm/c1ass/C1ASS_LOCKIN10.adl
D medm/c1ass/C1ASS_LOCKIN11.adl
D medm/c1ass/C1ASS_LOCKIN12.adl
D medm/c1ass/C1ASS_LOCKIN13.adl
D medm/c1ass/C1ASS_LOCKIN14.adl
D medm/c1ass/C1ASS_LOCKIN15.adl
D medm/c1ass/sen_yaw_mtrx.adl
D medm/c1ass/C1ASS_LOCKIN16.adl
D medm/c1ass/C1ASS_LOCKIN17.adl
D medm/c1ass/C1ASS_DOF_YAW.adl
D medm/c1ass/C1ASS_LOCKIN18.adl
D medm/c1ass/C1ASS_LOCKIN19.adl
D medm/c1ass/C1ASS_TRY_PIT_LOCKIN.adl
D medm/c1ass/ctrl_pit_mtrx.adl
D medm/c1ass/C1ASS_SEN_PIT_MTRX.adl
D medm/c1ass/C1ASS_LOCKIN20.adl
D medm/c1ass/C1ASS_LOCKIN21.adl
D medm/c1ass/C1ASS_LOCKIN22.adl
D medm/c1ass/C1ASS_LOCKIN23.adl
D medm/c1ass/C1ASS_LOCKIN24.adl
D medm/c1ass/C1ASS_LOCKIN25.adl
D medm/c1ass/C1ASS_LOCKIN26.adl
D medm/c1ass/C1ASS_LOCKIN27.adl
D medm/c1ass/C1ASS_TRX_PIT_LOCKIN.adl
D medm/c1ass/C1ASS_LOCKIN28.adl
D medm/c1ass/C1ASS_LOCKIN29.adl
D medm/c1ass/C1ASS_XARM_QPDS.adl
D medm/c1ass/C1ASS_YARM_QPDS.adl
M medm/c1ass/C1ASS_XARM_OUT_MTRX.adl
D medm/c1ass/ASS_SEN_MTRX.adl
D medm/c1ass/ASS_LOCKINS.adl
D medm/c1ass/sen_pit_mtrx.adl
D medm/c1ass/C1ASS_REFL11I_YAW_LOCKIN.adl
D medm/c1ass/C1ASS_LOCKIN30.adl
D medm/c1ass/C1ASS_DOF_PIT.adl
M models/c1ass.mdl
controls@rossa:/opt/rtcds/userapps/release/isc/c1 0$
|
8725
|
Wed Jun 19 16:04:56 2013 |
Jamie | Configuration | Computer Scripts / Programs | conlog startup fixed, and restarted | I cleaned up a bunch of conlog stuff to make it all a little more sane and simple. I also fixed the messy startup shenanigans, so that it should now start up sanely and on it's own (using Ubuntu's native upstart system). The conlog wiki page was updated with all the new info. |
8726
|
Wed Jun 19 16:47:34 2013 |
Jamie | Configuration | Computer Scripts / Programs | conlog startup fixed, and restarted |
Quote: |
I cleaned up a bunch of conlog stuff to make it all a little more sane and simple. I also fixed the messy startup shenanigans, so that it should now start up sanely and on it's own (using Ubuntu's native upstart system). The conlog wiki page was updated with all the new info.
|
By the way, I also did confirm that it is running and registering EPICS changes. |
8868
|
Thu Jul 18 10:47:21 2013 |
Jamie | Update | LSC | PRMI+Y arm ALS success! | AWESOME! You guys rock. |
8919
|
Wed Jul 24 19:21:56 2013 |
Jamie | HowTo | SUS | SUS MEDM screen modernization | I started poking around at what we want for new SUS MEDM screens. Rana and I decided we'd start with the ASC TIPTILT screens:

It's missing some things (like SIDE OSEMS) but it should provide a good starting point.
I copied the entire <userapps>/asc/common/medm/asctt directory to a new directory in our sus area:
controls@rossa:/opt/rtcds/userapps/release 0$ cp -a asc/common/medm/asctt sus/c1/medm/new
I then removed all the useless file name prefixes. We still need to go through and sed out all the ASC stuff in the MEDM files themselves.
It makes heavy use of macro substitution, which is good (it's what we're using now). So once we clean up all the channel names, we should just be able to swap out the pointers in our overview screens to the new screens (or rename things). In the mean time, during development, you can run:
controls@rossa:/opt/rtcds/userapps/release 0$ medm -x -macro "IFO=C1,ifo=c1,OPTIC=ITMX" sus/c1/medm/new/OVERVIEW.adl
|
8996
|
Mon Aug 12 13:30:33 2013 |
Jamie | Update | CDS | X-End Green ASS - Roundup |
Quote: |
I'm not really sure why the ASS was involved in this. I feel like it might have been simpler to just do everything in the ASX model, to keep things cleaner. Also, the IPC blocks for this stuff (in both ASS and ASX) are not on the top level of the model. I had thought that this was expressly forbidden (although I'm not sure why). I'm emailing Jamie, to see if he remembers what, if anything, is breakable if the IPC blocks are down a level.
|
I'm not sure if it's forbidden by the RCG, but you should definitely NOT do it. All IO, whether it be between ADC/DACs or IPCs, should always be at the model top level. That's what keeps things portable, and makes it easier to keep track of where are signals are going/coming from. |
9074
|
Tue Aug 27 19:34:36 2013 |
Jamie | Configuration | CDS | front end IPC configuration | So the IPC situation on the front end network is not so great right now. For various no-longer-valid reasons, c1lsc had no RFM card, all the IPC connections were routed through the c1rfm model on c1sus, and routed to c1lsc via dolphin PCIe as needed. As things grew, c1rfm became overloaded. Koji tried to fix the situation by breaking things out of c1rfm to make direct connections where we could. This cleared up c1rfm a bit, but not c1mcs is overloading.
Reminder: PCIe (dolphin) is faster and higher bandwidth than RFM. The more things we can put on PCIe the better.
Attached is a graph of my rough accounting of the intended direct IPC connections between the front ends. By "intended direct" I mean what should be direct connections if we had all the appropriate hardware. Right now the actual connection graph is more convoluted than this since things are passing through c1rfm. I note this graph was NOT particularly easy to make, which is very unfortunate. I had to manually look through every model and determine the ultimate source of every incoming IPC. Kind of a pain in the butt. It would be nice if there was a simple way to represent this.
Here are some various solutions to the problem as I see it:
a) put c1lsc on the RFM network
This would allow c1lsc to talk to c1ioo, c1iscex, and c1iscey without having to go through c1sus, thereby eliminating c1rfm altogether. I'm not sure why we didn't just do this originally.
Requires:
b) put c1ioo on the PCIe network (and move c1sus's RFM card to c1lsc)
This is probably the most robust solution.
b1) There are roughly 8 IPCs going from c1ioo to c1sus, and 4 going the other way, and 3 IPCs from c1ioo to c1lsc. If we put c1ioo on PCIe all of these now RFM connections would become direct PCIe connections, which would be a big win.
At this point only the end station front ends would be on RFM, and most of the connections to those come from c1lsc, so it would make sense to give c1lsc the RFM card, thereby eliminating a lot of stuff from c1rfm.
Requires:
- dolphin card for c1ioo (do the old sun machines support these? if they don't we could swap the old sun machine with a new spare aLIGO-approved supermicro machines, which we have spares of)
- dolphin fibre to go to dolphin switch in 1X3 rack
b2) OR, we could move c1ioo to 1X4 with c1lsc and c1sus, and get a OneStop fibre cable to connect to its IO chassis. We would still need a dolphin card, but we could use coper instead of fibre. This is my preferred solution, since it moves c1ioo out of 1X1, where it's really in the way and making a lot of noise. It would also be easier to manage all the machines if they're together in one rack.
Requires:
- dolphin card for c1ioo
- dolphin coper cable for c1ioo
- OneStop fibre for c1ioo
c) put another cpu in c1sus
c1sus is (I believe) able to support another 6-core cpu. If we added more cores to c1sus, we could break up c1rfm into c1rfm0, c1rfm1, etc. This is a less elegant solution imho, but it would probably do the job.
Requires:
|
Attachment 1: hosts.png
|
|
9075
|
Tue Aug 27 19:50:06 2013 |
Jamie | Configuration | Computer Scripts / Programs | cdsutils checked out into /opt/rtcds | I have checked out the new cdsutils repository at:
/opt/rtcds/cdsutils/release
This is a new repository that is intended to hold all of our python libraries and command-line utilities for interacting with the IFO, things like:
- get/write values EPICS channels
- interact with filter module switches
- average a test point for some amount of time
- etc.
Basically everything that used to be ez* or tds*.
There's not much in there at the moment, but hopefully it will start to get filled in soon.
WARNING:
This code in here will be used by the sites to interact with the real aLIGO IFOs. Please be careful as you develop things in here, and o so conscientiously. If you do bad things here and it messes things up at the sites people will be angry. Particularly me, since I have to support everything in here for Guardian use.
Usage
<cdsutils>/lib/cdsutils is the primary python library. For each function you want to add, put it in a new file named after the function. So for instance function "foo" should be in a file called <cdsutils>/lib/cdsutils/foo.py.
There is a command line utility at <cdsutils>/bin/cdsutils. It will automatically find anything you add to the library and expose it as a sub command (e.g. "cdsutils foo")
We'll try to put together a wiki page describing development and usage of this soon. |
9088
|
Thu Aug 29 17:25:50 2013 |
Jamie | Update | SUS | SUS medm screen upgrade | Rana asked me to look at the SUS MEDM screen upgrade situation, and provide an upgrade prescription. Unfortunately there not really a simple prescription that can be used, since our configuration diverges quite a bit from what's at the sites. But here's what I can say:
It looks like we already have the beginnings of an upgrade in place, so I say we just run with that. The new screens are in:
/opt/rtcds/userapps/release/sus/c1/medm/new
The primary screen is:
/opt/rtcds/userapps/release/sus/c1/medm/new/OVERVIEW.adl
This seems to be a copy of the site ASC_TIPTILT screens. (In fact I think I remember putting this here). I went ahead and did some ground work to make it easier to get these new screens into place.
- I cleaned up all the channel name prefixes so that at least the channel prefixes will resolve to our SUS channels.
- I made a link from the sitemap with some of the correct macros to fill some things in appropriately: "IFO SUS/NEW ETMX"
- I fixed the names to the sub-screens, so that it correctly opens the correct sub-screens (although the macros seem to not be passed correctly)
At this point someone needs to just go through and fix all the channel names to match ours, and tweak the screen to our needs (there's no side OSEM, for instance). The subscreens need to be cleaned up as well.
sed replace string
If there is a specific string you want to replace every instance of in the screen, you can do that easily from the command line like this:
sed -i 's/OLD/NEW/g'
This will replace every instance of the string OLD with the string new in the file path/to/file. Be careful: this will replace EVERY instance of OLD. If OLD matches things you don't want, they will be replaced as well.
This construction is actually "regular expressions", so if you want to get fancy you can match against more complicated strings. But just be careful.
If you leave out the "-i" the string-replaced text will go to stdout, instead of being replaced in the file "in place", so you can check it first.
query replace in emacs
If you want more fine-grained control of text replace, so that you can see what's being replaced, try using "query-replace" in emacs:
M-x query-replace
You can then type in the original string, followed by the replacement string. When it starts to run it will highlight the string that will be replaced. Hit "space" to accept or "n" to skip and go to the next.
|
9132
|
Mon Sep 16 15:29:50 2013 |
Jamie | Configuration | Computer Scripts / Programs | cdsutils checked out into /opt/rtcds | We now have a proper install of cdsutils:
controls@rossa:~ 0$ cdsutils
usage: cdsutils <cmd> <args>
Advanced LIGO Control Room Utilites
Available commands:
read read EPICS channel value
write write EPICS channel value
switch switch buttons in standard LIGO filter module
avg average NDS channels for some amount of time
servo simple integrator (pole at zero)
Add '-h' after individual commands for command help.
controls@rossa:~ 0$
It is installed in /ligo/apps/cdsutils, and should be in the path on all workstations.
The "development" source working directory is currently checked out at /opt/rtcds/cdsutils/trunk.
|
9138
|
Wed Sep 18 11:52:53 2013 |
Jamie | Update | CDS | Dataviewer cannot connect to fb |
Quote: |
Masayuki pointed out that dataviewer wasn't connecting to the fb this morning.
When I started dataviewer from the terminal I obtained the following error:
controls@pianosa:~ 0$ dataviewer
Can't find hostname `fb:8088'
Can't find hostname `fb:8088'; gethostbyname(); error=1
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Error in obtaining chan info.
Can't find hostname `fb:8088'
Can't find hostname `fb:8088'; gethostbyname(); error=1
I checked the CDS FE status screen and it looks normal. I could ping the fb and ssh to it as well.
I restarted fb to see if it made any difference. telnet fb 8088
It hasn't helped. Anything else that can be done??
|
I've fixed the problem. This was due to a change I made in the NDSSERVER environment variable so that it would work with cdsutils. I didn't realize there was an incompatibility with how dataviewer parses NDSSERVER. Joe and I will have to figure it out.
In the mean time I've changed things back so that that dataviewer should now work as expected. You might have to log out and back in for it to work (or at least open a new terminal). |
9309
|
Tue Oct 29 18:14:52 2013 |
Jamie | Configuration | Computer Scripts / Programs | fixing python-matplotlib from LSCSOFT | Jenne just discovered an issue with the python-matplotlib package that I knew was coming but forgot about.
We pull packages from the LSCSOFT Debian "squeeze" archive, which is a convenient way for us to install LIGO data analysis software. There are no LSCSOFT archives for Ubuntu, and Debian "squeeze" is the closest supported distribution to Ubuntu 10.04 "lucid", which is what we are using.
DASWG recently added python-matplotlib to the LSCSOFT squeeze archive. The version they added (1.0.1-3) supersedes the version in lucid, so by default apt wants to install it. However, the LSCSOFT version is compiled against a newer version of some standard libraries, so it won't function on our system and seg faults.
The solution (a solution) is to use apt "pinning" to pin the package to the lucid version that works. I've added the following file on all the 10.04 workstations to prevent the package from upgrading to the LSCSOFT version:
controls@pianosa:~ 0$ cat /etc/apt/preferences.d/pin_python-matplotlib
Package: python-matplotlib
Pin: release a=lucid
Pin-Priority: 1000
|
9310
|
Tue Oct 29 18:54:36 2013 |
Jamie | Configuration | Computer Scripts / Programs | fixing python-matplotlib from LSCSOFT |
Quote: |
controls@pianosa:~ 0$ cat /etc/apt/preferences.d/pin_python-matplotlib
Package: python-matplotlib
Pin: release a=lucid
Pin-Priority: 1000
|
I forgot that there were a couple of different matplotlib packages that all needed to be pinned. To be safe I decided to just pin all packages to the lucid versions. This will still allow us to install lscsoft packages that are not ubuntu, but it will always prefer packages from lucid instead. Here's the new pinning file:
controls@pianosa:~ 0$ cat /etc/apt/preferences.d/pinning
Package: *
Pin: release a=lucid
Pin-Priority: 1000
controls@pianosa:~ 0$
|
9423
|
Fri Nov 22 14:21:43 2013 |
Jamie | Update | Computer Scripts / Programs | DAQ? |
Quote: |
Jamie, I think the computers know that you are away. c1lsc keeps going down.
The short time plots are correct.
|
Is there some indication from the attached image that there is a problem with c1lsc? I see some drop outs in the channels you're plotting, but those are not c1lsc channels.
The channels with the drop outs are I think derived channels, as opposed to ones that are generated on the front end. Therefore they could have been affected by the c1auxey outages from earlier in the week. |
9426
|
Mon Nov 25 12:57:54 2013 |
Jamie | Update | CDS | timing problem at c1iscex IO chassis | There is definitely a timing distribution malfunction at the c1iscex IO chassis. There is no timing link between the "Master Timer Sequencer D050239" at the 1X6 and the c1iscex IO chassis. Link lights at both ends are dead. No timing, no running models.
It does not appear to be a problem with the Master Timer Sequencer. I moved the c1iscey link to the J15 port on the sequencer and it worked fine. This means its either a problem with the fiber or the timing card in the IO chassis. The IO timing card is powered and does have what appear to be normal status lights on (except for the fiber link lights). It's getting what I think is the nominal 4V power. The connection to the IO chassis backplane board look ok. So maybe it's just a dead fiber issue?
I do not know what could have been the problem with c1auxex, or if it's related to the fast timing issue. |
9433
|
Mon Dec 2 16:04:47 2013 |
Jamie | Update | CDS | c1iscex timing problem mysteriously disappears??? (thanksgiving miracle???) |
Quote: |
There is definitely a timing distribution malfunction at the c1iscex IO chassis. There is no timing link between the "Master Timer Sequencer D050239" at the 1X6 and the c1iscex IO chassis. Link lights at both ends are dead. No timing, no running models.
It does not appear to be a problem with the Master Timer Sequencer. I moved the c1iscey link to the J15 port on the sequencer and it worked fine. This means its either a problem with the fiber or the timing card in the IO chassis. The IO timing card is powered and does have what appear to be normal status lights on (except for the fiber link lights). It's getting what I think is the nominal 4V power. The connection to the IO chassis backplane board look ok. So maybe it's just a dead fiber issue?
I do not know what could have been the problem with c1auxex, or if it's related to the fast timing issue.
|
I just got over here from Downs, where I managed to convince Todd to let me borrow one of their three remaining timing slave boards for c1iscex. I walked down to the X end to replace the board only to discover that the link light on the existing timing board was back! c1iscex was not responding, so I hard rebooted the machine, and everything came up rosy (all green!):

To repeat, I DID NOTHING. The thing was working when I got here. I have no idea when it came back, or how, but it's at least working for the moment. I re-enabled the watchdog for ETMX SUS and it's now damped normally.
I'm going to hold on to the timing card for a couple of days, in case the failure comes back, but we'll need to return it to Downs soon, and probably think about getting some spare backups from Columbia. |
9502
|
Fri Dec 20 10:08:43 2013 |
Jamie | Configuration | General | netgpibdata is working again now |
Quote: |
Now netgpibdata is working again.
Usage:
cd /cvs/cds/rtcds/caltech/c1/scripts/general/netgpibdata
./netgpibdata -i 192.168.113.108 -d AG4395A -a 10 -f meas01
./netgpibdata -i 192.168.113.105 -d SR785 -a 6 -f meas01
|
Just wanted to point out that the correct "modern" path to this stuff is:
/opt/rtcds/caltech/c1/scripts/general/netgpibdata
This is, of course, the same directory, but under the correct "/opt/rtcds", instead of the old, incorrect "/cvs/cds". |
9503
|
Fri Dec 20 11:40:13 2013 |
Jamie | Summary | CDS | RCG parsing bug? | I submitted a bug report for this:
https://bugzilla.ligo-wa.caltech.edu/bugzilla3/show_bug.cgi?id=553
However, given how old our RCG version is (2.5 vs. 2.8 current deployed at the sites) I don't think we're going to see any traction on this. Even if this is still a bug in 2.8, they'll only fix it in 2.8. There's no way they're going to make a bug fix release for 2.5.
We need to upgrade. |
9513
|
Thu Jan 2 10:15:20 2014 |
Jamie | Summary | General | linux1 RAID crash & recovery | Well done Koji! I'm very impressed with the sysadmin skillz. |
9536
|
Tue Jan 7 23:53:35 2014 |
Jamie | Update | CDS | daqd can't connect to c1vac1, c1vac2 | dadq is logging the following error messages to it's log related to the fact that it can't connect to c1vac1 and c1vac2:
CAC: Unable to connect because "Connection timed out"
CA.Client.Exception...............................................
Warning: "Virtual circuit disconnect"
Context: "c1vac2.martian:5064"
Source File: ../cac.cpp line 1127
Current Time: Tue Jan 07 2014 23:50:53.355609430
..................................................................
CAC: Unable to connect because "Connection timed out"
CA.Client.Exception...............................................
Warning: "Virtual circuit disconnect"
Context: "c1vac1.martian:5064"
Source File: ../cac.cpp line 1127
Current Time: Tue Jan 07 2014 23:50:53.356568469
..................................................................
Not sure if this is related to the full /frames issue that we've been seeing. |
9574
|
Fri Jan 24 13:10:12 2014 |
Jamie | HowTo | LSC | Procedure to measure PRC length |
Quote: |
I wrote a MATLAB script that takes as input the measured distances and produce the optical path lengths. The script also produce a drawing of the setup as reconstructed, showing the measurement points, the mirrors, the reference base plates, and the beam path. Here is an example output, that can be used to understand which are the five distances to be measured. I used dummy measured distances to produce it.

|
This path does not look correct to me. Maybe it's because this is supposed to represent "optical path lengths" as opposed to actual physical location of optics, but I think locations should be checked. For instance, PRM looks like it's floating in mid-air between the BS and ITMX chambers, and PR2 is not located behind ITMX. Actually, come to think of it, it might just be that ITMX (or the ITMs in general) is in the wrong place?
Here is a similar diagram I produced when building a Finesse model of the 40m, based on the CAD drawing that Manasa is maintaining:

|
9966
|
Fri May 16 20:55:18 2014 |
Jamie | Frogs | lore | un-full-screening Ubuntu windows with F11 | Last week Rana and I struggled to figure out how to un-full-screen windows on the Ubuntu workstations that appeared to be stuck in some sort of full screen mode such that the "Titlebar" was not on the screen. Nothing seemed to work. We were in despair.
Well, there is now hope: it appears that this really is a "fullscreen" mode that can be activated by hitting F11. It can therefore easily be undone by hitting F11 again. |
10018
|
Tue Jun 10 09:25:29 2014 |
Jamie | Update | CDS | Computer status: should not be changing names | I really think it's a bad idea to be making all these names changes. You're making things much much harder for yourselves.
Instead of repointing everything to a new host, you should have just changed the DNS to point the name "linux1" to the IP address of the new server. That way you wouldn't need to reconfigure all of the clients. That's the whole point of name service: use a name so that you don't need to point to a number.
Also, pointing to an IP address for this stuff is not a good idea. If the IP address of the server changes, everything will break again.
Just point everything to linux1, and make the DNS entries for linux1 point to the IP address of chiara. You're doing all this work for nothing!
RXA: Of course, I understand what DNS means. I wanted to make the changes to the startup to remove any misconfigurations or spaghetti mount situations (of which we found many). The way the VME162 are designed, changing the name doesn't make the fix - it uses the number instead. And, of course, the main issue was not the DNS, but just that we had to setup RSH on the new machine. This is all detailed in the ELOG entries we've made, but it might be difficult to understand remotely if you are not familiar with the 40m CDS system. |
10033
|
Thu Jun 12 15:31:47 2014 |
Jamie | Update | CDS | Note on cables for talking to slow computers |
Quote: |
We have (now) in the lab 2 cables that are RJ45-DB9. The gray one is LIGO-made, while the blue one is store-bought.
The gray LIGO-made one works, but the blue store-bought one does not. I checked their pinouts, and they are completely different. On the sketch below, the pictures of the connectors is me looking at them face-on, with the cables going out the back of the page. The DB9 is female.
|
There are RJ45-DB9 adapters in the big spinny rack next to the linux1 rack that are for this exact purpose. Just use a stanard ethernet cable with them. |
10040
|
Sun Jun 15 14:26:30 2014 |
Jamie | Omnistructure | CDS | cdsutils re-installed |
Quote: |
CDSUTILS is also gone from the path on all the workstations, so we need Jamie to tell us by ELOG how to set it up, or else we have to use ezcaread / ezcawrite forever.
|
It's in the elog already: http://nodus.ligo.caltech.edu:8080/40m/9922
But it seems like things still haven't fully recovered, or have recovered to an old state? Why is the cdsutils install I previously did in /ligo/apps now missing? It seems like other directories are missing as well.
There's also a user:group issue with the /home/cds mounts. Everything in those mount points is owned nobody:nogroup.
I also can't log into pianosa and rosalba. |
10041
|
Sun Jun 15 14:41:08 2014 |
Jamie | Omnistructure | CDS | cdsutils re-installed |
Quote: |
Quote: |
CDSUTILS is also gone from the path on all the workstations, so we need Jamie to tell us by ELOG how to set it up, or else we have to use ezcaread / ezcawrite forever.
|
It's in the elog already: http://nodus.ligo.caltech.edu:8080/40m/9922
But it seems like things still haven't fully recovered, or have recovered to an old state? Why is the cdsutils install I previously did in /ligo/apps now missing? It seems like other directories are missing as well.
There's also a user:group issue with the /home/cds mounts. Everything in those mount points is owned nobody:nogroup.
I also can't log into pianosa and rosalba.
|
I also still think it's a bad idea for everything to be mounting /home/cds from an IP address. Just make a new DNS entry for linux1 and leave everything as it was. |
10190
|
Sun Jul 13 11:37:36 2014 |
Jamie | Update | Electronics | New Prologix GPIB-Ethernet controller |
Quote: |
I have configured a NEW Prologix GPIB-Ethernet controller to use with HP8591E Spectrum analyzer that sits right next to the control room computers.
Static IP: 192.168.113.109
Mask: 255.255.255.0
Gateway: 192.168.113.2
I have no clue how to give it a name like "something.martian" and to update the martian host table (Somebody please help!!)
|
The instructions for adding a name to the martian DNS table are in the wiki page that I pointed you to:
https://wiki-40m.ligo.caltech.edu/Martian_Host_Table |
10276
|
Sat Jul 26 13:38:34 2014 |
Jamie | Update | General | Data Acquisition from FC into EPICS Channels |
Quote: |
I succeeded in creating a new channel access server hosted on domenica ( R Pi) for continuous data acquisition from the FC into accessible channels. For this I have written a ctypes interface between EPICS and the C interface code to write data into the channels. The channels which I created are:
C1:ALS-X-BEAT-NOTE-FREQ
C1:ALS-Y-BEAT-NOTE-FREQ
The scripts I have written for this can be found in:
db script in: /users/akhil/fcreadoutIoc/fcreadoutApp/Db/fcreadout.db
Python code: /users/akhil/fcreadoutIoc/pycall
C code: /users/akhil/fcreadoutIoc/FCinterfaceCcode.c
I will give the standard channel names(similar to the names on the channel root)once the testing is completed and confirm that data from FC is consistent with the C code readout. Once ready I will run the code forever so that both the server and data acquisition are in process always.
Yesterday, when I set out to test the channel, I faced few serious issues in booting the raspberry pi. However, I have backed up the files on the Pi and will try to debug the issue very soon( I will test with Eric Q's R Pi).
To run these codes one must be root ( sudo python pycall, sudo ./FCinterfaceCcode) because the HID- devices can be written to only by the root(should look into solving this issue).
Instructions for Installation of EPICS, and how to create channel server on Pi will be described in detail in 40m Wiki ( FOLL page).
|
controls@rossa|~ 2> ls /users/akhil/fcreadoutIoc
ls: cannot access /users/akhil/fcreadoutIoc: No such file or directory
controls@rossa|~ 2>
This code should be in the 40m SVN somewhere, not just stored on the RPi.
I'm still confused why python is in the mix here at all. It doesn't make any sense at all that a C program (EPICS IOC) would be calling out to a python program (pycall) that then calls out to a C program (FCinterfaceCcode). That's bad programming. Streamline the program and get rid of python.
You also definitely need to fix whatever the issue is that requires running the program as root. We can't have programs like this run as root. |
11384
|
Tue Jun 30 11:33:00 2015 |
Jamie | Summary | CDS | prepping for CDS upgrade | This is going to be a big one. We're at version 2.5 and we're going to go to 2.9.3.
RCG components that need to be updated:
- mbuf kernel module
- mx_stream driver
- iniChk.pl script
- daqd
- nds
Supporting software:
- EPICS 3.14.12.2_long
- ldas-tools (framecpp) 1.19.32-p1
- libframe 8.17.2
- gds 2.16.3.2
- fftw 3.3.2
Things to watch out for:
- RTS 2.6:
- raw minute trend frame location has changed (CRC-based subdirectory)
- new kernel patch
- RTS 2.7:
- supports "commissioning frames", which we will probably not utilize. need to make sure that we're not writing extra frames somewhere
- RTS 2.8:
- "slow" (EPICS) data from the front-end processes is acquired via DAQ network, and not through EPICS. This will increase traffic on the DAQ lan. Hopefully this will not be an issue, and the existing network infrastructure can handle it, but it should be monitored.
|
11390
|
Wed Jul 1 19:16:21 2015 |
Jamie | Summary | CDS | CDS upgrade in progress | The CDS upgrade is now underway
Here's what's happened so far:
/opt/rtcds/rtscore/tags/advLigoRTS-2.9.4
That's it for today. Will pick up again first thing tomorrow |
11393
|
Tue Jul 7 18:27:54 2015 |
Jamie | Summary | CDS | CDS upgrade: progress! | After a couple of days of struggle, I made some progress on the CDS upgrade today:

Front end status:
- RTS upgraded to 2.9.4, and linked in as "release":
/opt/rtcds/rtscore/release -> tags/advLigoRTS-2.9.4
- mbuf kernel module built installed
- All front ends have been rebooted with the latest patched kernel (from 2.6 upgrade)
- All models have been rebuilt, installed, restarted. Only minor model issues had to be corrected (unterminated unused inputs mostly).
- awgtpman rebuilt, and installed/running on all front-ends
- open-mx upgraded to 1.5.2:
/opt/open-mx -> open-mx-1.5.2
- All front ends running latest version of mx_stream, built against 2.9.4 and open-mx-1.5.2.
We have new GDS overview screens for the front end models:

It's possible that our current lack of IRIG-B GPS distribution means that the 'TIM' status bit will always be red on the IOP models. Will consult with Rolf.
There are other new features in the front ends that I can get into later.
DAQ (fb) status:
- daqd and nds rebuilt against 2.9.4, both now running on fb
40m daqd compile flags:
cd src/daqd
./configure --enable-debug --disable-broadcast --without-myrinet --with-mx --enable-local-timing --with-epics=/opt/rtapps/epics/base --with-framecpp=/opt/rtapps/framecpp
make
make clean
install daqd /opt/rtcds/caltech/c1/target/fb/
However, daqd has unfortunately been very unstable, and I've been trying to figure out why. I originally thought it was some sort of timing issue, but now I'm not so sure.
I had to make the following changes to the daqdrc:
set gps_leaps = 820108813 914803214 1119744016;
That enumerates some list of leap seconds since some time. Not sure if that actually does anything, but I added the latest leap seconds anyway:
set symm_gps_offset=315964803;
This updates the silly, arbitrary GPS offset, that is required to be correct when not using external GPS reference.
Finally, the last thing I did that finally got it running stably was to turn off all trend frame writing:
# start trender;
# start trend-frame-saver;
# sync trend-frame-saver;
# start minute-trend-frame-saver;
# sync minute-trend-frame-saver;
# start raw_minute_trend_saver;
For whatever reason, it's the trend frame writing that that was causing things daqd to fall over after a short amount of time. I'll continue investigating tomorrow.
We still have a lot of cleanup burt restores, testing, etc. to do, but we're getting there. |
11396
|
Wed Jul 8 20:37:02 2015 |
Jamie | Summary | CDS | CDS upgrade: one step forward, two steps back | After determining yesterday that all the daqd issues were coming from the frame writing, I started to dig into it more today. I also spoke to Keith Thorne, and got some good suggestions from Gerrit Kuhn at GEO.
I realized that it probably wasn't the trend writing per se, but that turning on more writing to disk was causing increased load on daqd, and consequently on the system itself. With more frame writing turned on the memory consuption increased to the point of maxing out the physical RAM. The system the probably starting swaping, which certainly would have choked daqd.
I noticed that fb only had 4G of RAM, which Keith suggested was just not enough. Even if the memory consumption of daqd has increased significantly, it still seems like 4G would not be enough. I opened up fb only to find that fb actually had 8G of RAM installed! Not sure what happend to the other 4G, but somehow they were not visible to the system. Koji and I eventually determined, via some frankenstein operations with megatron, that the RAM was just dead. We then pulled 4G of RAM from megatron and replaced the bad RAM in fb, so that fb now has a full 8G of RAM .
Unfortunately, when we got fb fully back up and running we found that fb is not able to see any of the other hosts on the data concentrator network . mx_info, which displays the card and network status for the myricom myrinet fiber card, shows:
MX Version: 1.2.16
MX Build: controls@fb:/opt/src/mx-1.2.16 Tue May 21 10:58:40 PDT 2013
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0: 299.8 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
Status: Running, P0: Wrong Network
Network: Myrinet 10G
MAC Address: 00:60:dd:46:ea:ec
Product code: 10G-PCIE-8AL-S
Part number: 09-03916
Serial number: 352143
Mapper: 00:60:dd:46:ea:ec, version = 0x63e745ee, configured
Mapped hosts: 1
ROUTE COUNT
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- ---
0) 00:60:dd:46:ea:ec fb:0 D 0,0
Note that all front end machines should be listed in the table at the bottom, and they're not. Also note the "Wrong Network" note in the Status line above. It appears that the card has maybe been initialized in a bad state? Or Koji and I somehow disturbed the network when we were cleaning up things in the rack. "sudo /etc/init.d/mx restart" on fb doesn't solve the problem. We even rebooted fb and it didn't seem to help.
In any event, we're back to no data flow. I'll pick up again tomorrow. |
11397
|
Wed Jul 8 21:02:02 2015 |
Jamie | Summary | CDS | CDS upgrade: another step forward, so we're back to where we started (plus a bit?) | Koji did a bit of googling to determine that 'Wrong Network' status message could be explained by the fb myrinet operating in the wrong mode:
(This was the useful link to track down the issue (KA))
Network: Myrinet 10G
I didn't notice it before, but we should in fact be operating in "Ethernet" mode, since that's the fabric we're using for the DC network. Digging a bit deeper we found that the new version of mx (1.2.16) had indeed been configured with a different compile option than the 1.2.15 version had:
controls@fb ~ 0$ grep '$ ./configure' /opt/src/mx-1.2.15/config.log
$ ./configure --enable-ether-mode --prefix=/opt/mx
controls@fb ~ 0$ grep '$ ./configure' /opt/src/mx-1.2.16/config.log
$ ./configure --enable-mx-wire --prefix=/opt/mx-1.2.16
controls@fb ~ 0$
So that would entirely explain the problem. I re-linked mx to the older version (1.2.15), reloaded the mx drivers, and everything showed up correctly:
controls@fb ~ 0$ /opt/mx/bin/mx_info
MX Version: 1.2.12
MX Build: root@fb:/root/mx-1.2.12 Mon Nov 1 13:34:38 PDT 2010
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0: 299.8 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
Status: Running, P0: Link Up
Network: Ethernet 10G
MAC Address: 00:60:dd:46:ea:ec
Product code: 10G-PCIE-8AL-S
Part number: 09-03916
Serial number: 352143
Mapper: 00:60:dd:46:ea:ec, version = 0x00000000, configured
Mapped hosts: 6
ROUTE COUNT
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- ---
0) 00:60:dd:46:ea:ec fb:0 1,0
1) 00:25:90:0d:75:bb c1sus:0 1,0
2) 00:30:48:be:11:5d c1iscex:0 1,0
3) 00:30:48:d6:11:17 c1iscey:0 1,0
4) 00:30:48:bf:69:4f c1lsc:0 1,0
5) 00:14:4f:40:64:25 c1ioo:0 1,0
controls@fb ~ 0$
The front end hosts are also showing good omx info (even though they had been previously as well):
controls@c1lsc ~ 0$ /opt/open-mx/bin/omx_info
Open-MX version 1.5.2
build: controls@fb:/opt/src/open-mx-1.5.2 Tue May 21 11:03:54 PDT 2013
Found 1 boards (32 max) supporting 32 endpoints each:
c1lsc:0 (board #0 name eth1 addr 00:30:48:bf:69:4f)
managed by driver 'igb'
Peer table is ready, mapper is 00:30:48:d6:11:17
================================================
0) 00:30:48:bf:69:4f c1lsc:0
1) 00:60:dd:46:ea:ec fb:0
2) 00:25:90:0d:75:bb c1sus:0
3) 00:30:48:be:11:5d c1iscex:0
4) 00:30:48:d6:11:17 c1iscey:0
5) 00:14:4f:40:64:25 c1ioo:0
controls@c1lsc ~ 0$
This got all the mx_stream connections back up and running.
Unfortunately, daqd is back to being a bit flaky. With all frame writing enabled we saw daqd crash again. I then shut off all trend frame writing and we're back to a marginally stable state: we have data flowing from all front ends, and full frames are being written, but not trends.
I'll pick up on this again tomorrow, and maybe try to rebuild the new version of mx with the proper flags. |
11398
|
Thu Jul 9 13:26:47 2015 |
Jamie | Summary | CDS | CDS upgrade: new mx 1.2.16 installed | I rebuilt/installed mx 1.2.16 to use "ether-mode", instead of the default MX-10G:
controls@fb /opt/src/mx-1.2.16 0$ ./configure --enable-ether-mode --prefix=/opt/mx-1.2.16
...
controls@fb /opt/src/mx-1.2.16 0$ make
..
controls@fb /opt/src/mx-1.2.16 0$ make install
...
I then rebuilt/installed daqd so that it properly linked against the updated mx install:
controls@fb /opt/rtcds/rtscore/release/src/daqd 0$ ./configure --enable-debug --disable-broadcast --without-myrinet --with-mx --with epics=/opt/rtapps/epics/base --with-framecpp=/opt/rtapps/framecpp --enable-local-timing
...
controls@fb /opt/rtcds/rtscore/release/src/daqd 0$ make
...
controls@fb /opt/rtcds/rtscore/release/src/daqd 0$ install daqd /opt/rtcds/caltech/c1/target/fb/
It's now back to running and receiving data from the front ends (still not stable yet, though). |
11400
|
Thu Jul 9 16:50:13 2015 |
Jamie | Summary | CDS | CDS upgrade: if all else fails try throwing metal at the problem | I roped Rolf into coming over and adding his eyes to the problem. After much discussion we couldn't come up with any reasonable explanation for the problems we've been seeing other than daqd just needing a lot more resources that it did before. He said he had some old Sun SunFire X4600s from which we could pilfer memory. I went over to Downs and ripped all the CPU/memory cards out of one of his machines and stuffed them into fb:
fb now has 8 CPU and 16G of RAM
Unfortunately, this is still not enough. Or at least it didn't solve the problem; daqd is showing the same instabilities, falling over a couple of minutes after I turn on trend frame writing. As always, before daqd fails it starts spitting out the following to the logs:
[Thu Jul 9 16:37:09 2015] main profiler warning: 0 empty blocks in the buffer
followed by lines like:
[Thu Jul 9 16:37:27 2015] GPS MISS dcu 44 (ASX); dcu_gps=1120520264 gps=1120519812
right before it dies.
I'm no longer convinced that this is a resource issue, though, judging by the resource usage right before the crash:
top - 16:47:32 up 48 min, 5 users, load average: 0.91, 0.62, 0.61
Tasks: 2 total, 0 running, 2 sleeping, 0 stopped, 0 zombie
Cpu(s): 8.9%us, 0.9%sy, 0.0%ni, 89.1%id, 0.9%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 15952104k total, 13063468k used, 2888636k free, 138648k buffers
Swap: 1023996k total, 0k used, 1023996k free, 7672292k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12016 controls 20 0 8098m 4.4g 104m S 106 29.1 6:45.79 daqd
4953 controls 20 0 53580 6092 5096 S 0 0.0 0:00.04 nds
Load average less than 1 per CPU, plenty of free memory (~3G free, 0 swap), no waiting for IO (0.9%wa), etc. daqd is utilizing lots of threads, which should be spread across many cpus, so even the >100%CPU should be ok. I'm at a loss... |
11402
|
Mon Jul 13 01:11:14 2015 |
Jamie | Summary | CDS | CDS upgrade: current assessment | daqd is still behaving unstably. It's still unclear what the issue is.
The current failures look like disk IO contention. However, it's hard to see any evidince of daqd is suffering from large IO wait while it's failing.
The frame size itself is currently smaller than it was before the upgrade:
controls@fb /frames/full 0$ ls -alth 11190 | head
total 369G
drwxr-xr-x 321 controls controls 36K Jul 12 22:20 ..
drwxr-xr-x 2 controls controls 268K Jun 23 06:06 .
-rw-r--r-- 1 controls controls 67M Jun 23 06:06 C-R-1119099984-16.gwf
-rw-r--r-- 1 controls controls 68M Jun 23 06:06 C-R-1119099968-16.gwf
-rw-r--r-- 1 controls controls 69M Jun 23 06:05 C-R-1119099952-16.gwf
-rw-r--r-- 1 controls controls 69M Jun 23 06:05 C-R-1119099936-16.gwf
-rw-r--r-- 1 controls controls 67M Jun 23 06:05 C-R-1119099920-16.gwf
-rw-r--r-- 1 controls controls 68M Jun 23 06:05 C-R-1119099904-16.gwf
-rw-r--r-- 1 controls controls 68M Jun 23 06:04 C-R-1119099888-16.gwf
controls@fb /frames/full 0$ ls -alth 11208 | head
total 17G
drwxr-xr-x 2 controls controls 20K Jul 13 01:00 .
-rw-r--r-- 1 controls controls 45M Jul 13 01:00 C-R-1120809632-16.gwf
-rw-r--r-- 1 controls controls 50M Jul 13 01:00 C-R-1120809408-16.gwf
-rw-r--r-- 1 controls controls 50M Jul 13 00:56 C-R-1120809392-16.gwf
-rw-r--r-- 1 controls controls 50M Jul 13 00:56 C-R-1120809376-16.gwf
-rw-r--r-- 1 controls controls 50M Jul 13 00:56 C-R-1120809360-16.gwf
-rw-r--r-- 1 controls controls 50M Jul 13 00:55 C-R-1120809344-16.gwf
-rw-r--r-- 1 controls controls 50M Jul 13 00:55 C-R-1120809328-16.gwf
controls@fb /frames/full 0$
This would seem to indicate that it's not an increase in frame size that's to blame.
Because slow data is now transported to daqd over the MX data concentrator network rather than via EPICS (RTS 2.8), there is more network on the MX network. I note also that the channel lists have increased in size:
controls@fb /opt/rtcds/caltech/c1/chans/daq 0$ ls -alt archive/C1LSC* | head -20
-rw-r--r-- 1 4294967294 4294967294 262554 Jul 6 18:21 archive/C1LSC_150706_182146.ini
-rw-r--r-- 1 4294967294 4294967294 262554 Jul 6 18:16 archive/C1LSC_150706_181603.ini
-rw-r--r-- 1 4294967294 4294967294 262554 Jul 6 16:09 archive/C1LSC_150706_160946.ini
-rw-r--r-- 1 4294967294 4294967294 43366 Jul 1 16:05 archive/C1LSC_150701_160519.ini
-rw-r--r-- 1 4294967294 4294967294 43366 Jun 25 15:47 archive/C1LSC_150625_154739.ini
...
I would have thought, though, that data transmission errors would show up in the daqd status bits. |
11404
|
Mon Jul 13 18:12:50 2015 |
Jamie | Summary | CDS | CDS upgrade: left running in semi-stable configuration | I have been watching daqd all day and I don't feel particularly closer to understanding what the issues are. However, things are
Interestingly, though, the stability appears highly variable at the moment. This morning, daqd was very unstable and was crashing within a couple of minutes of starting. However this afternoon, things seemed much more stable. As of this moment, daqd has been running for for 25 minutes now, writing full frames as well as minute and second trends (no minute_raw), without any issues. What has changed?
To reiterate, I have been closing watching disk IO to /frames. I see no indication that there is any disk contention while daqd is failing. It's still possible, though, that there are disk IO issues affecting daqd at a level that is not readily visible. From dstat, the frame writes are visible, but nothing else.
I have made one change that could be positively affecting things right now: I un-exported /frames from NFS. This eliminates anything external from reading /frames over the network. In particular, it also shuts off the transfer of frames to LDAS. Since I've done this, daqd has appeared to be more stable. It's NOT totally stable, though, as the instance that I described above did eventually just die after 43 minutes, as I was writing this.
In any event, as things are currently as stable as I've seen them, I'm leaving it running in this configuration for the moment, with the following relevant daqdrc parameters:
start main 16;
start frame-saver;
sync frame-saver;
start trender 60 60;
start trend-frame-saver;
sync trend-frame-saver;
start minute-trend-frame-saver;
sync minute-trend-frame-saver;
start profiler;
start trend profiler; |
11406
|
Tue Jul 14 09:08:37 2015 |
Jamie | Summary | CDS | CDS upgrade: left running in semi-stable configuration | Overnight daqd restarted itself only about twice an hour, which is an improvement:
controls@fb /opt/rtcds/caltech/c1/target/fb 0$ tail logs/restart.log
daqd: Tue Jul 14 03:13:50 PDT 2015
daqd: Tue Jul 14 04:01:39 PDT 2015
daqd: Tue Jul 14 04:09:57 PDT 2015
daqd: Tue Jul 14 05:02:46 PDT 2015
daqd: Tue Jul 14 06:01:57 PDT 2015
daqd: Tue Jul 14 06:43:18 PDT 2015
daqd: Tue Jul 14 07:02:19 PDT 2015
daqd: Tue Jul 14 07:58:16 PDT 2015
daqd: Tue Jul 14 08:02:44 PDT 2015
daqd: Tue Jul 14 09:02:24 PDT 2015
Un-exporting /frames might have helped a bit. However, the problem is obviously still not fixed. |
11412
|
Tue Jul 14 16:51:01 2015 |
Jamie | Summary | CDS | CDS upgrade: problem is not disk access | I think I have now determined once and for all that the daqd problems are NOT due to disk IO contention.
I have mounted a tmpfs at /frames/tmp and have told daqd to write frames there. The tmpfs exists entirely in RAM. There is essentially zero IO wait for such a filesystem, so daqd should never have trouble writing out the frames.
But yet daqd continues to fail with the "0 empty blocks in the buffer" warnings. I've been down a rabbit hole. |
11415
|
Wed Jul 15 13:19:14 2015 |
Jamie | Summary | CDS | CDS upgrade: reducing mx end-points as last ditch effort | I tried one last thing, suggested by Keith and Gerrit. I tried reducing the number of mx end-points on fb to zero, which should reduce the total number of fb threads, in the hope that the extra threads were causing the chokes.
On Tue, Jul 14 2015, Keith Thorne <kthorne@ligo-la.caltech.edu> wrote:
> Assumptions
> 1) Before the upgrade (from RCG 2.6?), the DAQ had been working, reading out front-ends, writing frames trends
> 2) In upgrading to RCG 2.9, the mx start-up on the frame builder was modified to use multiple end-points
> (i.e. /etc/init.d/mx has a line like
> # 1 10G card - X2
> MX_MODULE_PARAMS="mx_max_instance=1 mx_max_endpoints=16 $MX_MODULE_PARAMS"
> (This can be confirmed by the daqd log file with lines at the top like
> 263596
> MX has 16 maximum end-points configured
> 2 MX NICs available
> [Fri Jul 10 16:12:50 2015] ->4: set thread_stack_size=10240
> [Fri Jul 10 16:12:50 2015] new threads will be created with the stack of size 10
> 240K
>
> If this is the case, the problem may be that the additional thread on the frame-builder (one per end-point) take up so many slots on the 8-core
> frame-builder that they interrupt the frame-writing thread, thus preventing the main buffer from being emptied.
>
> One could go back to a single end-point. This only helps keep restart of front-end A from hiccuping DAQ for front-end B.
>
> You would have to remove code on front-ends (/etc/init.d/mx_stream) that chooses endpoints. i.e.
> # find line number in rtsystab. Use that to mx_stream slot on card (0-15)
> line_num=`grep -v ^# /etc/rtsystab | grep --perl-regexp -n "^${hostname}\s" | se
> d 's/^\([0-9]*\):.*/\1/g'`
> line_off=$(expr $line_num - 1)
> epnum=$(expr $line_off % 2)
> cnum=$(expr $line_off / 2)
>
> start-stop-daemon --start --quiet -b -m --pidfile /var/log/mx_stream0.pid --exec /opt/rtcds/tst/x2/target/x2daqdc0/mx_stream -- -e 0 -r "$epnum" -W 0 -w 0 -s "$sys" -d x2daqdc0:$cnum -l /opt/rtcds/tst/x2/target/x2daqdc0/mx_stream_logs/$hostname.log
As per Keith's suggestion, I modified the mx startup script to only initialize a single endpoint, and I modified the mx_stream startup to point them all to endpoint 0. I verified that indeed daqd was a single MX end-point:
MX has 1 maximum end-points configured
It didn't help. After 5-10 minutes daqd crashes with the same "0 empty blocks" messages.
I should also mention that I'm pretty sure the start of these messages does not seem coincident with any frame writing to disk; further evidence that it's not a disk IO issue.
Keith is looking at the system now, so we if he can see anything obvious. If not, I will start reverting to 2.5. |
11417
|
Wed Jul 15 18:19:12 2015 |
Jamie | Summary | CDS | CDS upgrade: tentative stabilty? | Keith Thorne provided his eyes on the situation today and had some suggestions that might have helped things
Reorder ini file list in master file. Apparently the EDCU.ini file (C0EDCU.ini in our case), which describes EPICS subscriptions to be recorded by the daq, now has to be specified *after* all other front end ini files. It's unclear why, but it has something to do with RTS 2.8 which changed all slow channels to be transported over the mx network. This alone did not fix the problem, though.
Increase second trend frame size. Interestingly, this might have been the key. The second trend frame size was increased to 600 seconds:
start trender 600 60;
The two numbers are the lengths in seconds for the second and minute trends respectively. They had been set to "60 60", but Keith suggested that longer second trend frames are better, for whatever reason. It seems he may be right, given that daqd has been running and writing full and trend frames for 1.5 hours now without issue.
As I'm writing this, though, the daqd just crashed again. I note, though, that it's right after the hour, and immediately following writing out a one hour minute trend file. We've been seeing these hour, on the hour, crashes of daqd for quite a while now. So maybe this is nothing new. I've actually been wondering if the hourly daqd crashes were associated with writing out the minute trend frames, and I think we might have more evidence to point to that.
If increasing the size of the second trend frames from 60 seconds (35M) to 600 seconds (70M) made a difference in stability, could there be an issue since writing out files that are smaller than some value? The full frames are 60M, and the minute trends are 35M. |
11427
|
Sat Jul 18 15:37:19 2015 |
Jamie | Summary | CDS | CDS upgrade: current status | So it appears we have found a semi-stable configuration for the DAQ system post upgrade:

Here are the issues:
daqd
dadq is running mostly stably for the moment, although it still crashes at the top of every hour (see below). Here are some relevant points of about the current configuration:
- recording data from only a subset of front-ends, to reduce the overall load:
- c1x01
- c1scx
- c1x02
- c1sus
- c1mcs
- c1pem
- c1x04
- c1lsc
- c1ass
- c1x05
- c1scy
- 16 second main buffer:
start main 16;
- trend lengths: second: 600, minute: 60
start trender 600 60;
- writing to frames:
- full
- second
- minute
- (NOT raw minute trends)
- frame compression ON
This elliminates most of the random daqd crashing. However, daqd still crashes at the top of every hour after writing out the minute trend frame. Still unclear what the issue is, but Keith is investigating. In some sense this is no worse that where we were before the upgrade, since daqd was also crashing hourly then. It's still crappy, though, so hopefully we'll figure something out.
The inittab on fb automatically restarts daqd after it crashes, and monit on all of the front ends automatically restarts the mx_stream processes.
front ends
The front end modules are mostly running fine.
One issue is that the execution times seem to have increased a bit, which is problematic for models that were already on the hairy edge. For instance, the rough aversage for c1sus has some from ~48us to 50us. This is most problematic for c1cal, which is now running at ~66us out of 60, which is obviously untenable. We'll need to reduce the load in c1cal somehow.
All other front end models seem to be working fine, but a full test is still needed.
There was an issue with the DACs on c1sus, but I rebooted and everything came up fine, optics are now damped:

|
11455
|
Tue Jul 28 17:07:45 2015 |
Jamie | Update | General | Data missing |
Quote: |
For the past couple of days, the summary pages have shown minute trend data disappear at 12:00 UTC (05:00 AM local time). This seems to be the case for all channels that we plot, see e.g. https://nodus.ligo.caltech.edu:30889/detcharsummary/day/20150724/ioo/. Using Dataviewer, Koji has checked that indeed the frames seem to have disappeared from disk. The data come back at 24 UTC (5pm local). Any ideas why this might be?

|
Possible explanations:
- The data transfers to LDAS had been shut off while we were doing the DAQ debugging. I don't know if they have been turned back on. Unlikely this is the problem since you would probably see no data at all if this were the case.
- wiper script parameters might have been changed to store less of the trend data for some reason.
- Frame size is different and therefore wiper script parameters need to be adjusted.
- Steve deleted it all.
- ...
|
13125
|
Wed Jul 19 08:37:21 2017 |
Jamie | Update | CDS | Update on front-end/DAQ rebuild | After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ). The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO. We're trying to get the front ends working first, and will work on recovering daqd after.
Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image. We setup fb1 as the new boot server, and were able to get front ends booting again. Unfortunately, we've been having trouble running and building models, so something is still amis. We've been taking a three-pronged approach to getting the front ends running:
- /diskless/root.fb: This involves booting the front ends from the backup of the diskless root from fb. Runs gentoo kernel 2.6.34.1. This should correspond to the environment that all models were built and running against. But something is missing in the configuration. The front ends were also mounting /opt from fb, which included the dolphin drivers, and we don't have a copy of that, so models aren't loading or recompiling.
- /diskless/root.x1boot: Keith sent a disk image of the entire x1boot server from LLO. It uses gentoo kernel 3.0.8. This ostensibly includes everything we should need to run the front ends, but it's unfortunately configured with newer versions of some of the software and also isn't loading our existing models or building new ones. This also seems to be having issues with the dolphin drivers.
- /diskless/root.jessie: This is an entirely new boot image build from scratch with Debian jessie, using an RTS-patched 3.2 kernel. This would use the latest versions of everything. It's mostly working, we just need to rebuild the dolphin driver and source.
It seems that in all cases we need to rebuild the dolphin drivers from source. |
13127
|
Wed Jul 19 14:26:50 2017 |
Jamie | Update | CDS | Update on front-end/DAQ rebuild |
Quote: |
After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ). The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO. We're trying to get the front ends working first, and will work on recovering daqd after.
Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image. We setup fb1 as the new boot server, and were able to get front ends booting again. Unfortunately, we've been having trouble running and building models, so something is still amis. We've been taking a three-pronged approach to getting the front ends running:
- /diskless/root.fb: This involves booting the front ends from the backup of the diskless root from fb. Runs gentoo kernel 2.6.34.1. This should correspond to the environment that all models were built and running against. But something is missing in the configuration. The front ends were also mounting /opt from fb, which included the dolphin drivers, and we don't have a copy of that, so models aren't loading or recompiling.
- /diskless/root.x1boot: Keith sent a disk image of the entire x1boot server from LLO. It uses gentoo kernel 3.0.8. This ostensibly includes everything we should need to run the front ends, but it's unfortunately configured with newer versions of some of the software and also isn't loading our existing models or building new ones. This also seems to be having issues with the dolphin drivers.
- /diskless/root.jessie: This is an entirely new boot image build from scratch with Debian jessie, using an RTS-patched 3.2 kernel. This would use the latest versions of everything. It's mostly working, we just need to rebuild the dolphin driver and source.
It seems that in all cases we need to rebuild the dolphin drivers from source.
|
To clarify, we're able to boot the x1boot image with the existing 2.6.25 kernel that we have from fb. The issue with the root.x1boot image is not the kernel version but some of the other support libraries, such as dolphin. |
13130
|
Fri Jul 21 18:03:17 2017 |
Jamie | Update | CDS | Update on front-end/DAQ rebuild | Update:
- front ends booting with the new Debian jessie diskless root image and a linux 3.2 version of the RTS-patched kernel
- dolphin is configured correctly and running on c1lsc and c1sus
- models building and running with RCG 3.0.3
Up next:
- add c1ioo to the dolphin network
- recompile/restart all front end models
- daqd
I'll try to get the first two of those done tomorrow, although it's unclear what model updates we'll have to do to get things working with the newer RCG.
|
13132
|
Sun Jul 23 15:00:28 2017 |
Jamie | Omnistructure | VAC | strange sound around X arm vacuum pumps | While walking down to the X end to reset c1iscex I heard what I would call a "rythmic squnching" sound coming from under the turbo pump. I would have said the sound was coming from a roughing pump, but none of them are on (as far as I can tell).
Steve maybe look into this?? |
13136
|
Mon Jul 24 10:59:08 2017 |
Jamie | Update | CDS | c1iscex models died |
Quote: |
This morning, all the c1iscex models were dead. Attachment #1 shows the state of the cds overview screen when I came in. The machine itself was ssh-able, so I just restarted all the models and they came back online without fuss.
|
This was me. I had rebooted that machine and hadn't restarted the models. Sorry for the confusion. |
13138
|
Mon Jul 24 19:28:55 2017 |
Jamie | Update | CDS | front end MX stream network working, glitches in c1ioo fixed | MX/OpenMX network running
Today I got the mx/open-mx networking working for the front ends. This required some tweaking to the network interface configuration for the diskless front ends, and recompiling mx and open-mx for the newer kernel. Again, this will all be documented.
controls@fb1:~ 0$ /opt/mx/bin/mx_info
MX Version: 1.2.16
MX Build: root@fb1:/opt/src/mx-1.2.16 Mon Jul 24 11:33:57 PDT 2017
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0: 364.4 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
Status: Running, P0: Link Up
Network: Ethernet 10G
MAC Address: 00:60:dd:43:74:62
Product code: 10G-PCIE-8B-S
Part number: 09-04228
Serial number: 485052
Mapper: 00:60:dd:43:74:62, version = 0x00000000, configured
Mapped hosts: 6
ROUTE COUNT
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- ---
0) 00:60:dd:43:74:62 fb1:0 1,0
1) 00:30:48:be:11:5d c1iscex:0 1,0
2) 00:30:48:bf:69:4f c1lsc:0 1,0
3) 00:25:90:0d:75:bb c1sus:0 1,0
4) 00:30:48:d6:11:17 c1iscey:0 1,0
5) 00:14:4f:40:64:25 c1ioo:0 1,0
controls@fb1:~ 0$
c1ioo timing glitches fixed
I also checked the BIOS on c1ioo and found that the serial port was enabled, which is known to cause timing glitches. I turned off the serial port (and some power management stuff), and rebooted, and all the c1ioo timing glitches seem to have gone away.
It's unclear why this is a problem that's just showing up now. Serial ports have always been a problem, so it seems unlikely this is just a problem with the newer kernel. Could the BIOS have somehow been reset during the power glitch?
In any event, all the front ends are now booting cleanly, with all dolphin and mx networking coming up automatically, and all models running stably:

Now for daqd... |
|