40m QIL Cryo_Lab CTN SUS_Lab CAML OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 50 of 357  Not logged in ELOG logo
ID Date Authorup Type Category Subject
  8656   Thu May 30 11:28:34 2013 JamieConfigurationCDSc1als model cleanup

The c1als model was pulling out some ADC0 connections that were no longer used for anything:

  • ADC_0_1 --> sfm "FD" --> IPC "C1:ALS-SCX_FD"
  • ADC_0_5 --> sfm "OCX" --> term
  • ADC_0_6 --> sfm "ADC" --> term

The channels would have shown up as C1:ALS-FD, C1:ALS-OCX, C1:ALS-ADC.  The IPC connection that presumably was meant to go to c1scx is not connected on the other end.

I removed all this stuff from the model and rebuilt/restarted.

  8657   Thu May 30 11:33:26 2013 JamieConfigurationComputer Scripts / ProgramsASS medm/model changes need to be committed to SVN

There are a lot of changes to the ASS stuff that have not been committed to the SVN:

controls@rossa:/opt/rtcds/userapps/release/isc/c1 0$ svn status | grep -v '?'
M       medm/c1als/C1ALS_X_SLOW.adl
D       medm/c1ass/C1ASS_TRY_YAW_LOCKIN.adl
D       medm/c1ass/ASS_SERVOS.adl
D       medm/c1ass/ctrl_yaw_mtrx.adl
D       medm/c1ass/C1ASS_QPDS.adl
D       medm/c1ass/C1ASS_SEN_YAW_MTRX.adl
M       medm/c1ass/C1ASS_XARM_SEN_MTRX.adl
D       medm/c1ass/SITEMODEL_LOCKINNAME.adl
D       medm/c1ass/C1ASS_TRX_YAW_LOCKIN.adl
D       medm/c1ass/C1ASS_LOCKIN1.adl
D       medm/c1ass/C1ASS_LOCKIN2.adl
D       medm/c1ass/C1ASS_LOCKIN3.adl
D       medm/c1ass/C1ASS_LOCKIN4.adl
D       medm/c1ass/C1ASS_LOCKIN5.adl
D       medm/c1ass/C1ASS_LOCKIN6.adl
D       medm/c1ass/C1ASS_LOCKIN7.adl
D       medm/c1ass/C1ASS_LOCKIN8.adl
D       medm/c1ass/C1ASS_LOCKIN9.adl
D       medm/c1ass/C1ASS_REFL11I_PIT_LOCKIN.adl
M       medm/c1ass/C1ASS.adl
D       medm/c1ass/C1ASS_LOCKIN10.adl
D       medm/c1ass/C1ASS_LOCKIN11.adl
D       medm/c1ass/C1ASS_LOCKIN12.adl
D       medm/c1ass/C1ASS_LOCKIN13.adl
D       medm/c1ass/C1ASS_LOCKIN14.adl
D       medm/c1ass/C1ASS_LOCKIN15.adl
D       medm/c1ass/sen_yaw_mtrx.adl
D       medm/c1ass/C1ASS_LOCKIN16.adl
D       medm/c1ass/C1ASS_LOCKIN17.adl
D       medm/c1ass/C1ASS_DOF_YAW.adl
D       medm/c1ass/C1ASS_LOCKIN18.adl
D       medm/c1ass/C1ASS_LOCKIN19.adl
D       medm/c1ass/C1ASS_TRY_PIT_LOCKIN.adl
D       medm/c1ass/ctrl_pit_mtrx.adl
D       medm/c1ass/C1ASS_SEN_PIT_MTRX.adl
D       medm/c1ass/C1ASS_LOCKIN20.adl
D       medm/c1ass/C1ASS_LOCKIN21.adl
D       medm/c1ass/C1ASS_LOCKIN22.adl
D       medm/c1ass/C1ASS_LOCKIN23.adl
D       medm/c1ass/C1ASS_LOCKIN24.adl
D       medm/c1ass/C1ASS_LOCKIN25.adl
D       medm/c1ass/C1ASS_LOCKIN26.adl
D       medm/c1ass/C1ASS_LOCKIN27.adl
D       medm/c1ass/C1ASS_TRX_PIT_LOCKIN.adl
D       medm/c1ass/C1ASS_LOCKIN28.adl
D       medm/c1ass/C1ASS_LOCKIN29.adl
D       medm/c1ass/C1ASS_XARM_QPDS.adl
D       medm/c1ass/C1ASS_YARM_QPDS.adl
M       medm/c1ass/C1ASS_XARM_OUT_MTRX.adl
D       medm/c1ass/ASS_SEN_MTRX.adl
D       medm/c1ass/ASS_LOCKINS.adl
D       medm/c1ass/sen_pit_mtrx.adl
D       medm/c1ass/C1ASS_REFL11I_YAW_LOCKIN.adl
D       medm/c1ass/C1ASS_LOCKIN30.adl
D       medm/c1ass/C1ASS_DOF_PIT.adl
M       models/c1ass.mdl
controls@rossa:/opt/rtcds/userapps/release/isc/c1 0$
  8725   Wed Jun 19 16:04:56 2013 JamieConfigurationComputer Scripts / Programsconlog startup fixed, and restarted

I cleaned up a bunch of conlog stuff to make it all a little more sane and simple.  I also fixed the messy startup shenanigans, so that it should now start up sanely and on it's own (using Ubuntu's native upstart system).  The conlog wiki page was updated with all the new info.

  8726   Wed Jun 19 16:47:34 2013 JamieConfigurationComputer Scripts / Programsconlog startup fixed, and restarted

Quote:

I cleaned up a bunch of conlog stuff to make it all a little more sane and simple.  I also fixed the messy startup shenanigans, so that it should now start up sanely and on it's own (using Ubuntu's native upstart system).  The conlog wiki page was updated with all the new info.

 By the way, I also did confirm that it is running and registering EPICS changes.

  8868   Thu Jul 18 10:47:21 2013 JamieUpdateLSCPRMI+Y arm ALS success!

AWESOME!  You guys rock.

  8919   Wed Jul 24 19:21:56 2013 JamieHowToSUSSUS MEDM screen modernization

I started poking around at what we want for new SUS MEDM screens.  Rana and I decided we'd start with the ASC TIPTILT screens:

newsusmedm.png

It's missing some things (like SIDE OSEMS) but it should provide a good starting point.

I copied the entire <userapps>/asc/common/medm/asctt directory to a new directory in our sus area:

controls@rossa:/opt/rtcds/userapps/release 0$ cp -a asc/common/medm/asctt sus/c1/medm/new

I then removed all the useless file name prefixes.  We still need to go through and sed out all the ASC stuff in the MEDM files themselves.

It makes heavy use of macro substitution, which is good (it's what we're using now).  So once we clean up all the channel names, we should just be able to swap out the pointers in our overview screens to the new screens (or rename things).  In the mean time, during development, you can run:

controls@rossa:/opt/rtcds/userapps/release 0$ medm -x -macro "IFO=C1,ifo=c1,OPTIC=ITMX" sus/c1/medm/new/OVERVIEW.adl 

  8996   Mon Aug 12 13:30:33 2013 JamieUpdateCDSX-End Green ASS - Roundup

Quote:

I'm not really sure why the ASS was involved in this.  I feel like it might have been simpler to just do everything in the ASX model, to keep things cleaner.  Also, the IPC blocks for this stuff (in both ASS and ASX) are not on the top level of the model.  I had thought that this was expressly forbidden (although I'm not sure why).  I'm emailing Jamie, to see if he remembers what, if anything, is breakable if the IPC blocks are down a level.

I'm not sure if it's forbidden by the RCG, but you should definitely NOT do it.  All IO, whether it be between ADC/DACs or IPCs, should always be at the model top level.  That's what keeps things portable, and makes it easier to keep track of where are signals are going/coming from.

  9074   Tue Aug 27 19:34:36 2013 JamieConfigurationCDSfront end IPC configuration

So the IPC situation on the front end network is not so great right now.  For various no-longer-valid reasons, c1lsc had no RFM card, all the IPC connections were routed through the c1rfm model on c1sus, and routed to c1lsc via dolphin PCIe as needed.  As things grew, c1rfm became overloaded.  Koji tried to fix the situation by breaking things out of c1rfm to make direct connections where we could.  This cleared up c1rfm a bit, but not c1mcs is overloading.

Reminder: PCIe (dolphin) is faster and higher bandwidth than RFM.  The more things we can put on PCIe the better.

Attached is a graph of my rough accounting of the intended direct IPC connections between the front ends.  By "intended direct" I mean what should be direct connections if we had all the appropriate hardware.  Right now the actual connection graph is more convoluted than this since things are passing through c1rfm.  I note this graph was NOT particularly easy to make, which is very unfortunate.  I had to manually look through every model and determine the ultimate source of every incoming IPC.  Kind of a pain in the butt.  It would be nice if there was a simple way to represent this.

Here are some various solutions to the problem as I see it:

a) put c1lsc on the RFM network

This would allow c1lsc to talk to c1ioo, c1iscex, and c1iscey without having to go through c1sus, thereby eliminating c1rfm altogether.  I'm not sure why we didn't just do this originally.

Requires:

  • One RFM card for c1lsc

b) put c1ioo on the PCIe network (and move c1sus's RFM card to c1lsc)

This is probably the most robust solution.

b1) There are roughly 8 IPCs going from c1ioo to c1sus, and 4 going the other way, and 3 IPCs from c1ioo to c1lsc.  If we put c1ioo on PCIe all of these now RFM connections would become direct PCIe connections, which would be a big win.

At this point only the end station front ends would be on RFM, and most of the connections to those come from c1lsc, so it would make sense to give c1lsc the RFM card, thereby eliminating a lot of stuff from c1rfm.

Requires:

  • dolphin card for c1ioo (do the old sun machines support these?  if they don't we could swap the old sun machine with a new spare aLIGO-approved supermicro machines, which we have spares of)
  • dolphin fibre to go to dolphin switch in 1X3 rack

b2) OR, we could move c1ioo to 1X4 with c1lsc and c1sus, and get a OneStop fibre cable to connect to its IO chassis.  We would still need a dolphin card, but we could use coper instead of fibre.  This is my preferred solution, since it moves c1ioo out of 1X1, where it's really in the way and making a lot of noise.  It would also be easier to manage all the machines if they're together in one rack.

Requires:

  • dolphin card for c1ioo
  • dolphin coper cable for c1ioo
  • OneStop fibre for c1ioo

c) put another cpu in c1sus

c1sus is (I believe) able to support another 6-core cpu.  If we added more cores to c1sus, we could break up c1rfm into c1rfm0, c1rfm1, etc.  This is a less elegant solution imho, but it would probably do the job.

Requires:

  • one new CPU for c1sus
Attachment 1: hosts.png
hosts.png
  9075   Tue Aug 27 19:50:06 2013 JamieConfigurationComputer Scripts / Programscdsutils checked out into /opt/rtcds

I have checked out the new cdsutils repository at:

/opt/rtcds/cdsutils/release

This is a new repository that is intended to hold all of our python libraries and command-line utilities for interacting with the IFO, things like:

  • get/write values EPICS channels
  • interact with filter module switches
  • average a test point for some amount of time
  • etc.

Basically everything that used to be ez* or tds*.

There's not much in there at the moment, but hopefully it will start to get filled in soon.

WARNING:

This code in here will be used by the sites to interact with the real aLIGO IFOs.  Please be careful as you develop things in here, and o so conscientiously.  If you do bad things here and it messes things up at the sites people will be angry.  Particularly me, since I have to support everything in here for Guardian use.

Usage

<cdsutils>/lib/cdsutils is the primary python library.  For each function you want to add, put it in a new file named after the function.  So for instance function "foo" should be in a file called <cdsutils>/lib/cdsutils/foo.py.

There is a command line utility at <cdsutils>/bin/cdsutils.  It will automatically find anything you add to the library and expose it as a sub command (e.g. "cdsutils foo")

We'll try to put together a wiki page describing development and usage of this soon.

  9088   Thu Aug 29 17:25:50 2013 JamieUpdateSUSSUS medm screen upgrade

Rana asked me to look at the SUS MEDM screen upgrade situation, and provide an upgrade prescription.  Unfortunately there not really a simple prescription that can be used, since our configuration diverges quite a bit from what's at the sites.  But here's what I can say:

It looks like we already have the beginnings of an upgrade in place, so I say we just run with that.  The new screens are in:

/opt/rtcds/userapps/release/sus/c1/medm/new

The primary screen is:

/opt/rtcds/userapps/release/sus/c1/medm/new/OVERVIEW.adl

This seems to be a copy of the site ASC_TIPTILT screens.  (In fact I think I remember putting this here).  I went ahead and did some ground work to make it easier to get these new screens into place.

  • I cleaned up all the channel name prefixes so that at least the channel prefixes will resolve to our SUS channels.
  • I made a link from the sitemap with some of the correct macros to fill some things in appropriately: "IFO SUS/NEW ETMX"
  • I fixed the names to the sub-screens, so that it correctly opens the correct sub-screens (although the macros seem to not be passed correctly)

At this point someone needs to just go through and fix all the channel names to match ours, and tweak the screen to our needs (there's no side OSEM, for instance).  The subscreens need to be cleaned up as well.

sed replace string

If there is a specific string you want to replace every instance of in the screen, you can do that easily from the command line like this:

sed -i 's/OLD/NEW/g'  

This will replace every instance of the string OLD with the string new in the file path/to/file.  Be careful: this will replace EVERY instance of OLD.  If OLD matches things you don't want, they will be replaced as well.

This construction is actually "regular expressions", so if you want to get fancy you can match against more complicated strings.  But just be careful.

If you leave out the "-i" the string-replaced text will go to stdout, instead of being replaced in the file "in place", so you can check it first.

query replace in emacs

If you want more fine-grained control of text replace, so that you can see what's being replaced, try using "query-replace" in emacs:

M-x query-replace

You can then type in the original string, followed by the replacement string.  When it starts to run it will highlight the string that will be replaced.  Hit "space" to accept or "n" to skip and go to the next.

 

 

  9132   Mon Sep 16 15:29:50 2013 JamieConfigurationComputer Scripts / Programscdsutils checked out into /opt/rtcds

We now have a proper install of cdsutils:

 controls@rossa:~ 0$ cdsutils
 usage: cdsutils <cmd> <args>

 Advanced LIGO Control Room Utilites

 Available commands:

   read         read EPICS channel value
   write        write EPICS channel value
   switch       switch buttons in standard LIGO filter module
   avg          average NDS channels for some amount of time
   servo        simple integrator (pole at zero)

 Add '-h' after individual commands for command help.
 controls@rossa:~ 0$ 

It is installed in /ligo/apps/cdsutils, and should be in the path on all workstations.

The "development" source working directory is currently checked out at /opt/rtcds/cdsutils/trunk.

 

  9138   Wed Sep 18 11:52:53 2013 JamieUpdateCDSDataviewer cannot connect to fb

Quote:

Masayuki pointed out that dataviewer wasn't connecting to the fb this morning.

When I started dataviewer from the terminal I obtained the following error:

controls@pianosa:~ 0$ dataviewer
Can't find hostname `fb:8088'
Can't find hostname `fb:8088'; gethostbyname(); error=1
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Error in obtaining chan info.
Can't find hostname `fb:8088'
Can't find hostname `fb:8088'; gethostbyname(); error=1

I checked the CDS FE status screen and it looks normal. I could ping the fb and ssh to it as well.

I restarted fb to see if it made any difference. telnet fb 8088

It hasn't helped. Anything else that can be done??

I've fixed the problem.  This was due to a change I made in the NDSSERVER environment variable so that it would work with cdsutils.  I didn't realize there was an incompatibility with how dataviewer parses NDSSERVER.  Joe and I will have to figure it out.

In the mean time I've changed things back so that that dataviewer should now work as expected.  You might have to log out and back in for it to work (or at least open a new terminal).

  9309   Tue Oct 29 18:14:52 2013 JamieConfigurationComputer Scripts / Programsfixing python-matplotlib from LSCSOFT

Jenne just discovered an issue with the python-matplotlib package that I knew was coming but forgot about.

We pull packages from the LSCSOFT Debian "squeeze" archive, which is a convenient way for us to install LIGO data analysis software.  There are no LSCSOFT archives for Ubuntu, and Debian "squeeze" is the closest supported distribution to Ubuntu 10.04 "lucid", which is what we are using.

DASWG recently added python-matplotlib to the LSCSOFT squeeze archive.  The version they added (1.0.1-3) supersedes the version in lucid, so by default apt wants to install it.  However, the LSCSOFT version is compiled against a newer version of some standard libraries, so it won't function on our system and seg faults.

The solution (a solution) is to use apt "pinning" to pin the package to the lucid version that works.  I've added the following file on all the 10.04 workstations to prevent the package from upgrading to the LSCSOFT version:

controls@pianosa:~ 0$ cat /etc/apt/preferences.d/pin_python-matplotlib
Package: python-matplotlib
Pin: release a=lucid
Pin-Priority: 1000

 

  9310   Tue Oct 29 18:54:36 2013 JamieConfigurationComputer Scripts / Programsfixing python-matplotlib from LSCSOFT

Quote:
controls@pianosa:~ 0$ cat /etc/apt/preferences.d/pin_python-matplotlib
Package: python-matplotlib
Pin: release a=lucid
Pin-Priority: 1000 

I forgot that there were a couple of different matplotlib packages that all needed to be pinned.  To be safe I decided to just pin all packages to the lucid versions.  This will still allow us to install lscsoft packages that are not ubuntu, but it will always prefer packages from lucid instead.  Here's the new pinning file:

controls@pianosa:~ 0$ cat /etc/apt/preferences.d/pinning 
Package: *
Pin: release a=lucid
Pin-Priority: 1000
controls@pianosa:~ 0$ 

  9423   Fri Nov 22 14:21:43 2013 JamieUpdateComputer Scripts / ProgramsDAQ?

Quote:

Jamie, I think the computers know that you are away. c1lsc keeps going down.

The short time plots are correct.

Is there some indication from the attached image that there is a problem with c1lsc?  I see some drop outs in the channels you're plotting, but those are not c1lsc channels.

The channels with the drop outs are I think derived channels, as opposed to ones that are generated on the front end.  Therefore they could have been affected by the c1auxey outages from earlier in the week.

  9426   Mon Nov 25 12:57:54 2013 JamieUpdateCDStiming problem at c1iscex IO chassis

There is definitely a timing distribution malfunction at the c1iscex IO chassis.  There is no timing link between the "Master Timer Sequencer D050239" at the 1X6 and the c1iscex IO chassis.  Link lights at both ends are dead.  No timing, no running models.

It does not appear to be a problem with the Master Timer Sequencer.  I moved the c1iscey link to the J15 port on the sequencer and it worked fine.  This means its either a problem with the fiber or the timing card in the IO chassis.  The IO timing card is powered and does have what appear to be normal status lights on (except for the fiber link lights).  It's getting what I think is the nominal 4V power.  The connection to the IO chassis backplane board look ok.  So maybe it's just a dead fiber issue?

I do not know what could have been the problem with c1auxex, or if it's related to the fast timing issue.

  9433   Mon Dec 2 16:04:47 2013 JamieUpdateCDSc1iscex timing problem mysteriously disappears??? (thanksgiving miracle???)

Quote:

There is definitely a timing distribution malfunction at the c1iscex IO chassis.  There is no timing link between the "Master Timer Sequencer D050239" at the 1X6 and the c1iscex IO chassis.  Link lights at both ends are dead.  No timing, no running models.

It does not appear to be a problem with the Master Timer Sequencer.  I moved the c1iscey link to the J15 port on the sequencer and it worked fine.  This means its either a problem with the fiber or the timing card in the IO chassis.  The IO timing card is powered and does have what appear to be normal status lights on (except for the fiber link lights).  It's getting what I think is the nominal 4V power.  The connection to the IO chassis backplane board look ok.  So maybe it's just a dead fiber issue?

I do not know what could have been the problem with c1auxex, or if it's related to the fast timing issue.

I just got over here from Downs, where I managed to convince Todd to let me borrow one of their three remaining timing slave boards for c1iscex.  I walked down to the X end to replace the board only to discover that the link light on the existing timing board was back!  c1iscex was not responding, so I hard rebooted the machine, and everything came up rosy (all green!):

festatus.png

To repeat, I DID NOTHING.  The thing was working when I got here.  I have no idea when it came back, or how, but it's at least working for the moment.  I re-enabled the watchdog for ETMX SUS and it's now damped normally.

I'm going to hold on to the timing card for a couple of days, in case the failure comes back, but we'll need to return it to Downs soon, and probably think about getting some spare backups from Columbia.

  9502   Fri Dec 20 10:08:43 2013 JamieConfigurationGeneralnetgpibdata is working again now

Quote:

Now netgpibdata is working again.

Usage:

cd /cvs/cds/rtcds/caltech/c1/scripts/general/netgpibdata   
./netgpibdata -i 192.168.113.108 -d AG4395A -a 10 -f meas01
./netgpibdata -i 192.168.113.105 -d SR785 -a 6 -f meas01   

Just wanted to point out that the correct "modern" path to this stuff is:

/opt/rtcds/caltech/c1/scripts/general/netgpibdata

This is, of course, the same directory, but under the correct "/opt/rtcds", instead of the old, incorrect "/cvs/cds".

  9503   Fri Dec 20 11:40:13 2013 JamieSummaryCDSRCG parsing bug?

I submitted a bug report for this:

https://bugzilla.ligo-wa.caltech.edu/bugzilla3/show_bug.cgi?id=553

However, given how old our RCG version is (2.5 vs. 2.8 current deployed at the sites) I don't think we're going to see any traction on this.  Even if this is still a bug in 2.8, they'll only fix it in 2.8.  There's no way they're going to make a bug fix release for 2.5.

We need to upgrade.

  9513   Thu Jan 2 10:15:20 2014 JamieSummaryGenerallinux1 RAID crash & recovery

Well done Koji!  I'm very impressed with the sysadmin skillz.

  9536   Tue Jan 7 23:53:35 2014 JamieUpdateCDSdaqd can't connect to c1vac1, c1vac2

dadq is logging the following error messages to it's log related to the fact that it can't connect to c1vac1 and c1vac2:

CAC: Unable to connect because "Connection timed out"
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "c1vac2.martian:5064"
    Source File: ../cac.cpp line 1127
    Current Time: Tue Jan 07 2014 23:50:53.355609430
..................................................................
CAC: Unable to connect because "Connection timed out"
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "c1vac1.martian:5064"
    Source File: ../cac.cpp line 1127
    Current Time: Tue Jan 07 2014 23:50:53.356568469
..................................................................

 Not sure if this is related to the full /frames issue that we've been seeing.

  9574   Fri Jan 24 13:10:12 2014 JamieHowToLSCProcedure to measure PRC length

Quote:

I wrote a MATLAB script that takes as input the measured distances and produce the optical path lengths. The script also produce a drawing of the setup as reconstructed, showing the measurement points, the mirrors, the reference base plates,  and the beam path. Here is an example output, that can be used to understand which are the five distances to be measured. I used dummy measured distances to produce it.

map.pdf

This path does not look correct to me.  Maybe it's because this is supposed to represent "optical path lengths" as opposed to actual physical location of optics, but I think locations should be checked.  For instance, PRM looks like it's floating in mid-air between the BS and ITMX chambers, and PR2 is not located behind ITMX.  Actually, come to think of it, it might just be that ITMX (or the ITMs in general) is in the wrong place?

Here is a similar diagram I produced when building a Finesse model of the 40m, based on the CAD drawing that Manasa is maintaining:

path.pdf

  9966   Fri May 16 20:55:18 2014 JamieFrogsloreun-full-screening Ubuntu windows with F11

Last week Rana and I struggled to figure out how to un-full-screen windows on the Ubuntu workstations that appeared to be stuck in some sort of full screen mode such that the "Titlebar" was not on the screen.  Nothing seemed to work.  We were in despair.

Well, there is now hope: it appears that this really is a "fullscreen" mode that can be activated by hitting F11.  It can therefore easily be undone by hitting F11 again.

  10018   Tue Jun 10 09:25:29 2014 JamieUpdateCDSComputer status: should not be changing names

I really think it's a bad idea to be making all these names changes.  You're making things much much harder for yourselves.

Instead of repointing everything to a new host, you should have just changed the DNS to point the name "linux1" to the IP address of the new server.  That way you wouldn't need to reconfigure all of the clients.  That's the whole point of  name service: use a name so that you don't need to point to a number.

Also, pointing to an IP address for this stuff is not a good idea.  If the IP address of the server changes, everything will break again.

Just point everything to linux1, and make the DNS entries for linux1 point to the IP address of chiara.  You're doing all this work for nothing!

RXA: Of course, I understand what DNS means. I wanted to make the changes to the startup to remove any misconfigurations or spaghetti mount situations (of which we found many). The way the VME162 are designed, changing the name doesn't make the fix - it uses the number instead. And, of course, the main issue was not the DNS, but just that we had to setup RSH on the new machine. This is all detailed in the ELOG entries we've made, but it might be difficult to understand remotely if you are not familiar with the 40m CDS system.

  10033   Thu Jun 12 15:31:47 2014 JamieUpdateCDSNote on cables for talking to slow computers

Quote:

We have (now) in the lab 2 cables that are RJ45-DB9.  The gray one is LIGO-made, while the blue one is store-bought.  

The gray LIGO-made one works, but the blue store-bought one does not.  I checked their pinouts, and they are completely different.  On the sketch below, the pictures of the connectors is me looking at them face-on, with the cables going out the back of the page.  The DB9 is female. 

 There are RJ45-DB9 adapters in the big spinny rack next to the linux1 rack that are for this exact purpose.  Just use a stanard ethernet cable with them.

  10040   Sun Jun 15 14:26:30 2014 JamieOmnistructureCDScdsutils re-installed

Quote:

 CDSUTILS is also gone from the path on all the workstations, so we need Jamie to tell us by ELOG how to set it up, or else we have to use ezcaread / ezcawrite forever.

It's in the elog already: http://nodus.ligo.caltech.edu:8080/40m/9922

But it seems like things still haven't fully recovered, or have recovered to an old state?  Why is the cdsutils install I previously did in /ligo/apps now missing?  It seems like other directories are missing as well.

There's also a user:group issue with the /home/cds mounts.  Everything in those mount points is owned nobody:nogroup.

I also can't log into pianosa and rosalba.

  10041   Sun Jun 15 14:41:08 2014 JamieOmnistructureCDScdsutils re-installed

Quote:

Quote:

 CDSUTILS is also gone from the path on all the workstations, so we need Jamie to tell us by ELOG how to set it up, or else we have to use ezcaread / ezcawrite forever.

It's in the elog already: http://nodus.ligo.caltech.edu:8080/40m/9922

But it seems like things still haven't fully recovered, or have recovered to an old state?  Why is the cdsutils install I previously did in /ligo/apps now missing?  It seems like other directories are missing as well.

There's also a user:group issue with the /home/cds mounts.  Everything in those mount points is owned nobody:nogroup.

I also can't log into pianosa and rosalba.

 I also still think it's a bad idea for everything to be mounting /home/cds from an IP address.  Just make a new DNS entry for linux1 and leave everything as it was.

  10190   Sun Jul 13 11:37:36 2014 JamieUpdateElectronicsNew Prologix GPIB-Ethernet controller

Quote:

I have configured a NEW Prologix GPIB-Ethernet controller to use with HP8591E Spectrum analyzer that sits right next to the control room computers.

Static IP: 192.168.113.109

Mask: 255.255.255.0

Gateway: 192.168.113.2

I have no clue how to give it a name like "something.martian" and to update the martian host table (Somebody please help!!) 

The instructions for adding a name to the martian DNS table are in the wiki page that I pointed you to:

https://wiki-40m.ligo.caltech.edu/Martian_Host_Table

  10276   Sat Jul 26 13:38:34 2014 JamieUpdateGeneralData Acquisition from FC into EPICS Channels

Quote:

 I succeeded in creating a new channel access server hosted on domenica ( R Pi) for continuous data acquisition from the FC into  accessible channels. For this I have written a ctypes interface between EPICS and the C interface code to write data into the channels. The channels which I created are:

C1:ALS-X-BEAT-NOTE-FREQ

C1:ALS-Y-BEAT-NOTE-FREQ

 

The scripts I have written for this can be found in:

db script in:     /users/akhil/fcreadoutIoc/fcreadoutApp/Db/fcreadout.db

 Python code:  /users/akhil/fcreadoutIoc/pycall

C code:          /users/akhil/fcreadoutIoc/FCinterfaceCcode.c

I will give the standard channel names(similar to the names on the channel root)once the testing is completed and confirm that data from FC is consistent with the C code readout. Once ready I will run the code forever so that both the server and data acquisition are in process always.

Yesterday, when I set out to test the channel, I faced few serious issues in booting the raspberry pi. However, I have backed up the files on the Pi and will try to debug the issue very soon( I will test with Eric Q's R Pi).

To run these codes one must be root ( sudo python pycall, sudo ./FCinterfaceCcode)  because the HID- devices can be written to only by the root(should look into solving this issue). 

Instructions for Installation of EPICS, and how to create channel server on Pi will be described in detail in 40m Wiki ( FOLL page).

 

controls@rossa|~ 2> ls /users/akhil/fcreadoutIoc
ls: cannot access /users/akhil/fcreadoutIoc: No such file or directory
controls@rossa|~ 2> 

This code should be in the 40m SVN somewhere, not just stored on the RPi.

I'm still confused why python is in the mix here at all.  It doesn't make any sense at all that a C program (EPICS IOC) would be calling out to a python program (pycall) that then calls out to a C program (FCinterfaceCcode).  That's bad programming.  Streamline the program and get rid of python.

You also definitely need to fix whatever the issue is that requires running the program as root.  We can't have programs like this run as root.

  11384   Tue Jun 30 11:33:00 2015 JamieSummaryCDSprepping for CDS upgrade

This is going to be a big one.  We're at version 2.5 and we're going to go to 2.9.3.

RCG components that need to be updated:

  • mbuf kernel module
  • mx_stream driver
  • iniChk.pl script
  • daqd
  • nds

Supporting software:

  • EPICS 3.14.12.2_long
  • ldas-tools (framecpp) 1.19.32-p1
  • libframe 8.17.2
  • gds 2.16.3.2
  • fftw 3.3.2

Things to watch out for:

  • RTS 2.6:
    • raw minute trend frame location has changed (CRC-based subdirectory)
    • new kernel patch
  • RTS 2.7:
    • supports "commissioning frames", which we will probably not utilize.  need to make sure that we're not writing extra frames somewhere
  • RTS 2.8:
    • "slow" (EPICS) data from the front-end processes is acquired via DAQ network, and not through EPICS.  This will increase traffic on the DAQ lan.  Hopefully this will not be an issue, and the existing network infrastructure can handle it, but it should be monitored.
  11390   Wed Jul 1 19:16:21 2015 JamieSummaryCDSCDS upgrade in progress

The CDS upgrade is now underway

Here's what's happened so far:

  • Installed and linked in all the RTS supporting software packages in /opt/rtapps (only on front end machines and fb):
    controls@c1lsc ~ 2$ find /opt/rtapps/ -mindepth 1 -maxdepth 1 -type l -ls
    12582916    0 lrwxrwxrwx   1 controls 1001           12 Jul  1 13:16 /opt/rtapps/gds -> gds-2.16.3.2
    12603452    0 lrwxrwxrwx   1 controls 1001           10 Jul  1 13:17 /opt/rtapps/fftw -> fftw-3.3.2
    12603451    0 lrwxrwxrwx   1 controls 1001           15 Jul  1 13:16 /opt/rtapps/libframe -> libframe-8.17.2
    12603450    0 lrwxrwxrwx   1 controls 1001           13 Jul  1 13:16 /opt/rtapps/libmetaio -> libmetaio-8.2
    12582915    0 lrwxrwxrwx   1 controls 1001           34 Jul  1 15:24 /opt/rtapps/framecpp -> ldas-tools-1.19.32-p1/linux-x86_64
    12582914    0 lrwxrwxrwx   1 controls 1001           20 Jul  1 13:15 /opt/rtapps/epics -> epics-3.14.12.2_long
  • Checked out the RTS source for the version we'll be using: 2.9.4

/opt/rtcds/rtscore/tags/advLigoRTS-2.9.4

  • built and installed all of the RTS components:
    • mbuf
    • mx_stream
    • daqd
    • nds
    • awgtpman
       
  • mx_stream is not working. Unknown why. It won't start on the front end machines (only tested on c1lsc so far) with the following error:
    controls@c1lsc ~ 1$ /opt/rtcds/caltech/c1/target/fb/mx_stream -s c1x04 c1lsc c1ass c1oaf c1cal -d fb:0
    mmapped address is 0x7ff7b71a0000
    send len = 263596
    mx_connect failed Remote Endpoint is Closed
    controls@c1lsc ~ 1$
    
    Have contact Keith T. and Rolf B. for backup.  This is a blocker, since this is what ferries the data from the front ends.
     
  • Rebuilt almost all models.  This was good.  Initially nothing would compile because of IPC creation errors, so I moved the old chans/ipc/C1.ipc file out of the way and generated a new one and then everything compiled (of course senders have to be compiled before receivers).
    I only had to fix a couple of things in the models themselves:
    • c1ioo - unterminated FiltCtrl inputs
    • C1_SUS_SINGLE_CONTROL - unterminated FiltCtrl inputs
    • c1oaf - bad part named "STATIC". There is some hacky namespace stuff going on in the RCG. I was able to just explode that part and it now works.
    • c1lsc - unterminated FiltCtrl inputs
    Haven't installed or tried to run anything yet, but the fact they compile is good.
    Some models are not compiling because they have C code in src blocks that are throwing errors:
    • c1lsc
    • c1cal
    It shouldn't be too hard to fix whatever is causing those compile errors.

That's it for today.  Will pick up again first thing tomorrow

  11393   Tue Jul 7 18:27:54 2015 JamieSummaryCDSCDS upgrade: progress!

After a couple of days of struggle, I made some progress on the CDS upgrade today:

Front end status:

  • RTS upgraded to 2.9.4, and linked in as "release":

/opt/rtcds/rtscore/release -> tags/advLigoRTS-2.9.4

  • mbuf kernel module built installed
  • All front ends have been rebooted with the latest patched kernel (from 2.6 upgrade)
  • All models have been rebuilt, installed, restarted.  Only minor model issues had to be corrected (unterminated unused inputs mostly).
  • awgtpman rebuilt, and installed/running on all front-ends
  • open-mx upgraded to 1.5.2:

/opt/open-mx -> open-mx-1.5.2

  • All front ends running latest version of mx_stream, built against 2.9.4 and open-mx-1.5.2.

We have new GDS overview screens for the front end models:

It's possible that our current lack of IRIG-B GPS distribution means that the 'TIM' status bit will always be red on the IOP models.  Will consult with Rolf.

There are other new features in the front ends that I can get into later.

DAQ (fb) status:

  • daqd and nds rebuilt against 2.9.4, both now running on fb

40m daqd compile flags:

cd src/daqd
./configure --enable-debug --disable-broadcast --without-myrinet --with-mx --enable-local-timing --with-epics=/opt/rtapps/epics/base --with-framecpp=/opt/rtapps/framecpp
make
make clean
install daqd /opt/rtcds/caltech/c1/target/fb/

However, daqd has unfortunately been very unstable, and I've been trying to figure out why.  I originally thought it was some sort of timing issue, but now I'm not so sure.

I had to make the following changes to the daqdrc:

set gps_leaps = 820108813 914803214 1119744016;

That enumerates some list of leap seconds since some time.  Not sure if that actually does anything, but I added the latest leap seconds anyway:

set symm_gps_offset=315964803;

This updates the silly, arbitrary GPS offset, that is required to be correct when not using external GPS reference.

Finally, the last thing I did that finally got it running stably was to turn off all trend frame writing:

# start trender;
# start trend-frame-saver;
# sync trend-frame-saver;
# start minute-trend-frame-saver;
# sync minute-trend-frame-saver;
# start raw_minute_trend_saver;

For whatever reason, it's the trend frame writing that that was causing things daqd to fall over after a short amount of time.  I'll continue investigating tomorrow.

 

We still have a lot of cleanup burt restores, testing, etc. to do, but we're getting there.

  11396   Wed Jul 8 20:37:02 2015 JamieSummaryCDSCDS upgrade: one step forward, two steps back

After determining yesterday that all the daqd issues were coming from the frame writing, I started to dig into it more today.  I also spoke to Keith Thorne, and got some good suggestions from Gerrit Kuhn at GEO.

I  realized that it probably wasn't the trend writing per se, but that turning on more writing to disk was causing increased load on daqd, and consequently on the system itself.  With more frame writing turned on the memory consuption increased to the point of maxing out the physical RAM.  The system the probably starting swaping, which certainly would have choked daqd.

I noticed that fb only had 4G of RAM, which Keith suggested was just not enough.  Even if the memory consumption of daqd has increased significantly, it still seems like 4G would not be enough.  I opened up fb only to find that fb actually had 8G of RAM installed!  Not sure what happend to the other 4G, but somehow they were not visible to the system.  Koji and I eventually determined, via some frankenstein operations with megatron, that the RAM was just dead.  We then pulled 4G of RAM from megatron and replaced the bad RAM in fb, so that fb now has a full 8G of RAM cool.

Unfortunately, when we got fb fully back up and running we found that fb is not able to see any of the other hosts on the data concentrator network sad.  mx_info, which displays the card and network status for the myricom myrinet fiber card, shows:

MX Version: 1.2.16
MX Build: controls@fb:/opt/src/mx-1.2.16 Tue May 21 10:58:40 PDT 2013
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
    8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0:  299.8 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
    Status:        Running, P0: Wrong Network
    Network:    Myrinet 10G

    MAC Address:    00:60:dd:46:ea:ec
    Product code:    10G-PCIE-8AL-S
    Part number:    09-03916
    Serial number:    352143
    Mapper:        00:60:dd:46:ea:ec, version = 0x63e745ee, configured
    Mapped hosts:    1

                                                        ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                        P0
-----    -----------     ---------                        ---
   0) 00:60:dd:46:ea:ec fb:0                            D 0,0

Note that all front end machines should be listed in the table at the bottom, and they're not.   Also note the "Wrong Network" note in the Status line above.  It appears that the card has maybe been initialized in a bad state?  Or Koji and I somehow disturbed the network when we were cleaning up things in the rack.  "sudo /etc/init.d/mx restart" on fb doesn't solve the problem.  We even rebooted fb and it didn't seem to help.

In any event, we're back to no data flow.  I'll pick up again tomorrow.

  11397   Wed Jul 8 21:02:02 2015 JamieSummaryCDSCDS upgrade: another step forward, so we're back to where we started (plus a bit?)

Koji did a bit of googling to determine that 'Wrong Network' status message could be explained by the fb myrinet  operating in the wrong mode:
(This was the useful link to track down the issue (KA))
 

    Network:    Myrinet 10G

I didn't notice it before, but we should in fact be operating in "Ethernet" mode, since that's the fabric we're using for the DC network.  Digging a bit deeper we found that the new version of mx (1.2.16) had indeed been configured with a different compile option than the 1.2.15 version had:

controls@fb ~ 0$ grep '$ ./configure' /opt/src/mx-1.2.15/config.log          
  $ ./configure --enable-ether-mode --prefix=/opt/mx
controls@fb ~ 0$ grep '$ ./configure' /opt/src/mx-1.2.16/config.log
  $ ./configure --enable-mx-wire --prefix=/opt/mx-1.2.16
controls@fb ~ 0$

So that would entirely explain the problem.  I re-linked mx to the older version (1.2.15), reloaded the mx drivers, and everything showed up correctly:

controls@fb ~ 0$ /opt/mx/bin/mx_info
MX Version: 1.2.12
MX Build: root@fb:/root/mx-1.2.12 Mon Nov  1 13:34:38 PDT 2010
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
    8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0:  299.8 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
    Status:        Running, P0: Link Up
    Network:    Ethernet 10G

    MAC Address:    00:60:dd:46:ea:ec
    Product code:    10G-PCIE-8AL-S
    Part number:    09-03916
    Serial number:    352143
    Mapper:        00:60:dd:46:ea:ec, version = 0x00000000, configured
    Mapped hosts:    6

                                                        ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                        P0
-----    -----------     ---------                        ---
   0) 00:60:dd:46:ea:ec fb:0                              1,0
   1) 00:25:90:0d:75:bb c1sus:0                           1,0
   2) 00:30:48:be:11:5d c1iscex:0                         1,0
   3) 00:30:48:d6:11:17 c1iscey:0                         1,0
   4) 00:30:48:bf:69:4f c1lsc:0                           1,0
   5) 00:14:4f:40:64:25 c1ioo:0                           1,0
controls@fb ~ 0$

The front end hosts are also showing good omx info (even though they had been previously as well):

controls@c1lsc ~ 0$ /opt/open-mx/bin/omx_info
Open-MX version 1.5.2
 build: controls@fb:/opt/src/open-mx-1.5.2 Tue May 21 11:03:54 PDT 2013

Found 1 boards (32 max) supporting 32 endpoints each:
 c1lsc:0 (board #0 name eth1 addr 00:30:48:bf:69:4f)
   managed by driver 'igb'

Peer table is ready, mapper is 00:30:48:d6:11:17
================================================
  0) 00:30:48:bf:69:4f c1lsc:0
  1) 00:60:dd:46:ea:ec fb:0
  2) 00:25:90:0d:75:bb c1sus:0
  3) 00:30:48:be:11:5d c1iscex:0
  4) 00:30:48:d6:11:17 c1iscey:0
  5) 00:14:4f:40:64:25 c1ioo:0
controls@c1lsc ~ 0$

This got all the mx_stream connections back up and running.

Unfortunately, daqd is back to being a bit flaky.  With all frame writing enabled we saw daqd crash again.  I then shut off all trend frame writing and we're back to a marginally stable state: we have data flowing from all front ends, and full frames are being written, but not trends.

I'll pick up on this again tomorrow, and maybe try to rebuild the new version of mx with the proper flags.

  11398   Thu Jul 9 13:26:47 2015 JamieSummaryCDSCDS upgrade: new mx 1.2.16 installed

I rebuilt/installed mx 1.2.16 to use "ether-mode", instead of the default MX-10G:

controls@fb /opt/src/mx-1.2.16 0$ ./configure --enable-ether-mode --prefix=/opt/mx-1.2.16
...
controls@fb /opt/src/mx-1.2.16 0$ make
..
controls@fb /opt/src/mx-1.2.16 0$ make install
...

I then rebuilt/installed daqd so that it properly linked against the updated mx install:

controls@fb /opt/rtcds/rtscore/release/src/daqd 0$ ./configure --enable-debug --disable-broadcast --without-myrinet --with-mx --with epics=/opt/rtapps/epics/base --with-framecpp=/opt/rtapps/framecpp --enable-local-timing
...
controls@fb /opt/rtcds/rtscore/release/src/daqd 0$ make
...
controls@fb /opt/rtcds/rtscore/release/src/daqd 0$ install daqd /opt/rtcds/caltech/c1/target/fb/

It's now back to running and receiving data from the front ends (still not stable yet, though).

  11400   Thu Jul 9 16:50:13 2015 JamieSummaryCDSCDS upgrade: if all else fails try throwing metal at the problem

I roped Rolf into coming over and adding his eyes to the problem.  After much discussion we couldn't come up with any reasonable explanation for the problems we've been seeing other than daqd just needing a lot more resources that it did before.  He said he had some old Sun SunFire X4600s from which we could pilfer memory.  I went over to Downs and ripped all the CPU/memory cards out of one of his machines and stuffed them into fb:

fb now has 8 CPU and 16G of RAM

Unfortunately, this is still not enough.  Or at least it didn't solve the problem; daqd is showing the same instabilities, falling over a couple of minutes after I turn on trend frame writing.  As always, before daqd fails it starts spitting out the following to the logs:

[Thu Jul  9 16:37:09 2015] main profiler warning: 0 empty blocks in the buffer

followed by lines like:

[Thu Jul  9 16:37:27 2015] GPS MISS dcu 44 (ASX); dcu_gps=1120520264 gps=1120519812

right before it dies.

I'm no longer convinced that this is a resource issue, though, judging by the resource usage right before the crash:

top - 16:47:32 up 48 min,  5 users,  load average: 0.91, 0.62, 0.61
Tasks:   2 total,   0 running,   2 sleeping,   0 stopped,   0 zombie
Cpu(s):  8.9%us,  0.9%sy,  0.0%ni, 89.1%id,  0.9%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  15952104k total, 13063468k used,  2888636k free,   138648k buffers
Swap:  1023996k total,        0k used,  1023996k free,  7672292k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12016 controls  20   0 8098m 4.4g 104m S  106 29.1   6:45.79 daqd
 4953 controls  20   0 53580 6092 5096 S    0  0.0   0:00.04 nds

Load average less than 1 per CPU, plenty of free memory (~3G free, 0 swap), no waiting for IO (0.9%wa), etc.  daqd is utilizing lots of  threads, which should be spread across many cpus, so even the >100%CPU should be ok.   I'm at a loss...

  11402   Mon Jul 13 01:11:14 2015 JamieSummaryCDSCDS upgrade: current assessment

daqd is still behaving unstably.  It's still unclear what the issue is.

The current failures look like disk IO contention.  However, it's hard to see any evidince of daqd is suffering from large IO wait while it's failing.

The frame size itself is currently smaller than it was before the upgrade:

controls@fb /frames/full 0$ ls -alth 11190 | head
total 369G
drwxr-xr-x 321 controls controls  36K Jul 12 22:20 ..
drwxr-xr-x   2 controls controls 268K Jun 23 06:06 .
-rw-r--r--   1 controls controls  67M Jun 23 06:06 C-R-1119099984-16.gwf
-rw-r--r--   1 controls controls  68M Jun 23 06:06 C-R-1119099968-16.gwf
-rw-r--r--   1 controls controls  69M Jun 23 06:05 C-R-1119099952-16.gwf
-rw-r--r--   1 controls controls  69M Jun 23 06:05 C-R-1119099936-16.gwf
-rw-r--r--   1 controls controls  67M Jun 23 06:05 C-R-1119099920-16.gwf
-rw-r--r--   1 controls controls  68M Jun 23 06:05 C-R-1119099904-16.gwf
-rw-r--r--   1 controls controls  68M Jun 23 06:04 C-R-1119099888-16.gwf
controls@fb /frames/full 0$ ls -alth 11208 | head
total 17G
drwxr-xr-x   2 controls controls  20K Jul 13 01:00 .
-rw-r--r--   1 controls controls  45M Jul 13 01:00 C-R-1120809632-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 01:00 C-R-1120809408-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 00:56 C-R-1120809392-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 00:56 C-R-1120809376-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 00:56 C-R-1120809360-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 00:55 C-R-1120809344-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 00:55 C-R-1120809328-16.gwf
controls@fb /frames/full 0$

This would seem to indicate that it's not an increase in frame size that's to blame.

Because slow data is now transported to daqd over the MX data concentrator network rather than via EPICS (RTS 2.8), there is more network on the MX network.   I note also that the channel lists have increased in size:

controls@fb /opt/rtcds/caltech/c1/chans/daq 0$ ls -alt archive/C1LSC* | head -20
-rw-r--r-- 1 4294967294 4294967294 262554 Jul  6 18:21 archive/C1LSC_150706_182146.ini
-rw-r--r-- 1 4294967294 4294967294 262554 Jul  6 18:16 archive/C1LSC_150706_181603.ini
-rw-r--r-- 1 4294967294 4294967294 262554 Jul  6 16:09 archive/C1LSC_150706_160946.ini
-rw-r--r-- 1 4294967294 4294967294  43366 Jul  1 16:05 archive/C1LSC_150701_160519.ini
-rw-r--r-- 1 4294967294 4294967294  43366 Jun 25 15:47 archive/C1LSC_150625_154739.ini
...

I would have thought, though, that data transmission errors would show up in the daqd status bits.

  11404   Mon Jul 13 18:12:50 2015 JamieSummaryCDSCDS upgrade: left running in semi-stable configuration

I have been watching daqd all day and I don't feel particularly closer to understanding what the issues are.  However, things are

Interestingly, though, the stability appears highly variable at the moment.  This morning, daqd was very unstable and was crashing within a couple of minutes of starting.  However this afternoon, things seemed much more stable.  As of this moment, daqd has been running for for 25 minutes now, writing full frames as well as minute and second trends (no minute_raw), without any issues.  What has changed?

To reiterate, I have been closing watching disk IO to /frames.  I see no indication that there is any disk contention while daqd is failing.  It's still possible, though, that there are disk IO issues affecting daqd at a level that is not readily visible.  From dstat, the frame writes are visible, but nothing else.

I have made one change that could be positively affecting things right now: I un-exported /frames from NFS.  This eliminates anything external from reading /frames over the network.  In particular, it also shuts off the transfer of frames to LDAS.  Since I've done this, daqd has appeared to be more stable.  It's NOT totally stable, though, as the instance that I described above did eventually just die after 43 minutes, as I was writing this.

In any event, as things are currently as stable as I've seen them, I'm leaving it running in this configuration for the moment, with the following relevant daqdrc parameters:

start main 16;
start frame-saver;
sync frame-saver;
start trender 60 60;
start trend-frame-saver;
sync trend-frame-saver;
start minute-trend-frame-saver;
sync minute-trend-frame-saver;
start profiler;
start trend profiler;
  11406   Tue Jul 14 09:08:37 2015 JamieSummaryCDSCDS upgrade: left running in semi-stable configuration

Overnight daqd restarted itself only about twice an hour, which is an improvement:

controls@fb /opt/rtcds/caltech/c1/target/fb 0$ tail logs/restart.log
daqd: Tue Jul 14 03:13:50 PDT 2015
daqd: Tue Jul 14 04:01:39 PDT 2015
daqd: Tue Jul 14 04:09:57 PDT 2015
daqd: Tue Jul 14 05:02:46 PDT 2015
daqd: Tue Jul 14 06:01:57 PDT 2015
daqd: Tue Jul 14 06:43:18 PDT 2015
daqd: Tue Jul 14 07:02:19 PDT 2015
daqd: Tue Jul 14 07:58:16 PDT 2015
daqd: Tue Jul 14 08:02:44 PDT 2015
daqd: Tue Jul 14 09:02:24 PDT 2015

Un-exporting /frames might have helped a bit.  However, the problem is obviously still not fixed.

  11412   Tue Jul 14 16:51:01 2015 JamieSummaryCDSCDS upgrade: problem is not disk access

I think I have now determined once and for all that the daqd problems are NOT due to disk IO contention.

I have mounted a tmpfs at /frames/tmp and have told daqd to write frames there.  The tmpfs exists entirely in RAM.  There is essentially zero IO wait for such a filesystem, so daqd should never have trouble writing out the frames.

But yet daqd continues to fail with the "0 empty blocks in the buffer" warnings.  I've been down a rabbit hole.

  11415   Wed Jul 15 13:19:14 2015 JamieSummaryCDSCDS upgrade: reducing mx end-points as last ditch effort

I tried one last thing, suggested by Keith and Gerrit.  I tried reducing the number of mx end-points on fb to zero, which should reduce the total number of fb threads, in the hope that the extra threads were causing the chokes.

On Tue, Jul 14 2015, Keith Thorne <kthorne@ligo-la.caltech.edu> wrote:
> Assumptions
>  1) Before the upgrade (from RCG 2.6?), the DAQ had been working, reading out front-ends, writing frames trends
>  2) In upgrading to RCG 2.9, the mx start-up on the frame builder was modified to use multiple end-points
> (i.e. /etc/init.d/mx has a line like
> # 1 10G card - X2
> MX_MODULE_PARAMS="mx_max_instance=1 mx_max_endpoints=16 $MX_MODULE_PARAMS"
>  (This can be confirmed by the daqd log file with lines at the top like
> 263596
> MX has 16 maximum end-points configured
> 2 MX NICs available
> [Fri Jul 10 16:12:50 2015] ->4: set thread_stack_size=10240
> [Fri Jul 10 16:12:50 2015] new threads will be created with the stack of size 10
> 240K
>
> If this is the case, the problem may be that the additional thread on the frame-builder (one per end-point) take up so many slots on the 8-core
> frame-builder that they interrupt the frame-writing thread, thus preventing the main buffer from being emptied.  
>
> One could go back to a single end-point. This only helps keep restart of front-end A from hiccuping DAQ for front-end B.
>
> You would have to remove code on front-ends (/etc/init.d/mx_stream) that chooses endpoints. i.e.
> # find line number in rtsystab. Use that to mx_stream slot on card (0-15)
> line_num=`grep -v ^# /etc/rtsystab | grep --perl-regexp -n "^${hostname}\s" | se
> d 's/^\([0-9]*\):.*/\1/g'`
> line_off=$(expr $line_num - 1)
> epnum=$(expr $line_off % 2)
> cnum=$(expr $line_off / 2)
>
>     start-stop-daemon --start --quiet -b -m --pidfile /var/log/mx_stream0.pid --exec /opt/rtcds/tst/x2/target/x2daqdc0/mx_stream -- -e 0 -r "$epnum" -W 0 -w 0 -s "$sys" -d x2daqdc0:$cnum -l /opt/rtcds/tst/x2/target/x2daqdc0/mx_stream_logs/$hostname.log

As per Keith's suggestion, I modified the mx startup script to only initialize a single endpoint, and I modified the mx_stream startup to point them all to endpoint 0.  I verified that indeed daqd was a single MX end-point:

MX has 1 maximum end-points configured

It didn't help.  After 5-10 minutes daqd crashes with the same "0 empty blocks" messages.

I should also mention that I'm pretty sure the start of these messages does not seem coincident with any frame writing to disk; further evidence that it's not a disk IO issue.

Keith is looking at the system now, so we if he can see anything obvious.  If not, I will start reverting to 2.5.

  11417   Wed Jul 15 18:19:12 2015 JamieSummaryCDSCDS upgrade: tentative stabilty?

Keith Thorne provided his eyes on the situation today and had some suggestions that might have helped things

Reorder ini file list in master file.  Apparently the EDCU.ini file (C0EDCU.ini in our case), which describes EPICS subscriptions to be recorded by the daq, now has to be specified *after* all other front end ini files.  It's unclear why, but it has something to do with RTS 2.8 which changed all slow channels to be transported over the mx network.  This alone did not fix the problem, though.

Increase second trend frame size.  Interestingly, this might have been the key.  The second trend frame size was increased to 600 seconds:

start trender 600 60;

The two numbers are the lengths in seconds for the second and minute trends respectively.  They had been set to "60 60", but Keith suggested that longer second trend frames are better, for whatever reason.  It seems he may be right, given that daqd has been running and writing full and trend frames for 1.5 hours now without issue. 


As I'm writing this, though, the daqd just crashed again.  I note, though, that it's right after the hour, and immediately following writing out a one hour minute trend file.  We've been seeing these hour, on the hour, crashes of daqd for quite a while now.  So maybe this is nothing new.  I've actually been wondering if the hourly daqd crashes were associated with writing out the minute trend frames, and I think we might have more evidence to point to that.

If increasing the size of the second trend frames from 60 seconds (35M) to 600 seconds (70M) made a difference in stability, could there be an issue since writing out files that are smaller than some value?  The full frames are 60M, and the minute trends are 35M.

  11427   Sat Jul 18 15:37:19 2015 JamieSummaryCDSCDS upgrade: current status

So it appears we have found a semi-stable configuration for the DAQ system post upgrade:

Here are the issues:

daqd

dadq is running mostly stably for the moment, although it still crashes at the top of every hour (see below).  Here are some relevant points of about the current configuration:

  • recording data from only a subset of front-ends, to reduce the overall load:
    • c1x01
    • c1scx
    • c1x02
    • c1sus
    • c1mcs
    • c1pem
    • c1x04
    • c1lsc
    • c1ass
    • c1x05
    • c1scy
  • 16 second main buffer:
    start main 16;
  • trend lengths: second: 600, minute: 60
    start trender 600 60;
  • writing to frames:
    • full
    • second
    • minute
    • (NOT raw minute trends)
  • frame compression ON

This elliminates most of the random daqd crashing.  However, daqd still crashes at the top of every hour after writing out the minute trend frame. Still unclear what the issue is, but Keith is investigating.  In some sense this is no worse that where we were before the upgrade, since daqd was also crashing hourly then.  It's still crappy, though, so hopefully we'll figure something out.

The inittab on fb automatically restarts daqd after it crashes, and monit on all of the front ends automatically restarts the mx_stream processes.

front ends

The front end modules are mostly running fine.

One issue is that the execution times seem to have increased a bit, which is problematic for models that were already on the hairy edge.  For instance, the rough aversage for c1sus has some from ~48us to 50us.  This is most problematic for c1cal, which is now running at ~66us out of 60, which is obviously untenable.  We'll need to reduce the load in c1cal somehow.

All other front end models seem to be working fine, but a full test is still needed.

There was an issue with the DACs on c1sus, but I rebooted and everything came up fine, optics are now damped:

  11455   Tue Jul 28 17:07:45 2015 JamieUpdateGeneralData missing
Quote:

For the past couple of days, the summary pages have shown minute trend data disappear at 12:00 UTC (05:00 AM local time). This seems to be the case for all channels that we plot, see e.g. https://nodus.ligo.caltech.edu:30889/detcharsummary/day/20150724/ioo/. Using Dataviewer, Koji has checked that indeed the frames seem to have disappeared from disk. The data come back at 24 UTC (5pm local). Any ideas why this might be?

Possible explanations:

  • The data transfers to LDAS had been shut off while we were doing the DAQ debugging. I don't know if they have been turned back on.  Unlikely this is the problem since you would probably see no data at all if this were the case.
  • wiper script parameters might have been changed to store less of the trend data for some reason.
  • Frame size is different and therefore wiper script parameters need to be adjusted.
  • Steve deleted it all.
  • ...
  13125   Wed Jul 19 08:37:21 2017 JamieUpdateCDSUpdate on front-end/DAQ rebuild

After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ).  The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO.  We're trying to get the front ends working first, and will work on recovering daqd after.

Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image.  We setup fb1 as the new boot server, and were able to get front ends booting again.  Unfortunately, we've been having trouble running and building models, so something is still amis.  We've been taking a three-pronged approach to getting the front ends running:

  • /diskless/root.fb: This involves booting the front ends from the backup of the diskless root from fb.  Runs gentoo kernel 2.6.34.1.  This should correspond to the environment that all models were built and running against.  But something is missing in the configuration.  The front ends were also mounting /opt from fb, which included the dolphin drivers, and we don't have a copy of that, so models aren't loading or recompiling.
  • /diskless/root.x1boot: Keith sent a disk image of the entire x1boot server from LLO.  It uses gentoo kernel 3.0.8.  This ostensibly includes everything we should need to run the front ends, but it's unfortunately configured with newer versions of some of the software and also isn't loading our existing models or building new ones.  This also seems to be having issues with the dolphin drivers.
  • /diskless/root.jessie: This is an entirely new boot image build from scratch with Debian jessie, using an RTS-patched 3.2 kernel.  This would use the latest versions of everything.  It's mostly working, we just need to rebuild the dolphin driver and source.

It seems that in all cases we need to rebuild the dolphin drivers from source.

  13127   Wed Jul 19 14:26:50 2017 JamieUpdateCDSUpdate on front-end/DAQ rebuild

 

Quote:

After the catastrophic fb disk failure last week we lost essentially the entire front end system (not any of the userapp code, but the front end boot server, operating system, and DAQ).  The fb disk was entirely unrecoverable, so we've been trying to rebuild everything from the bits and pieces lying around, and some disks that Keith Thorne sent from LLO.  We're trying to get the front ends working first, and will work on recovering daqd after.

Luckily, fb1, which was being configured as an fb replacement, is mostly fully configured, including having a copy of the front end diskless root image.  We setup fb1 as the new boot server, and were able to get front ends booting again.  Unfortunately, we've been having trouble running and building models, so something is still amis.  We've been taking a three-pronged approach to getting the front ends running:

  • /diskless/root.fb: This involves booting the front ends from the backup of the diskless root from fb.  Runs gentoo kernel 2.6.34.1.  This should correspond to the environment that all models were built and running against.  But something is missing in the configuration.  The front ends were also mounting /opt from fb, which included the dolphin drivers, and we don't have a copy of that, so models aren't loading or recompiling.
  • /diskless/root.x1boot: Keith sent a disk image of the entire x1boot server from LLO.  It uses gentoo kernel 3.0.8.  This ostensibly includes everything we should need to run the front ends, but it's unfortunately configured with newer versions of some of the software and also isn't loading our existing models or building new ones.  This also seems to be having issues with the dolphin drivers.
  • /diskless/root.jessie: This is an entirely new boot image build from scratch with Debian jessie, using an RTS-patched 3.2 kernel.  This would use the latest versions of everything.  It's mostly working, we just need to rebuild the dolphin driver and source.

It seems that in all cases we need to rebuild the dolphin drivers from source.

To clarify, we're able to boot the x1boot image with the existing 2.6.25 kernel that we have from fb.  The issue with the root.x1boot image is not the kernel version but some of the other support libraries, such as dolphin.

  13130   Fri Jul 21 18:03:17 2017 JamieUpdateCDSUpdate on front-end/DAQ rebuild

Update:

  • front ends booting with the new Debian jessie diskless root image and a linux 3.2 version of the RTS-patched kernel
  • dolphin is configured correctly and running on c1lsc and c1sus
  • models building and running with RCG 3.0.3

Up next:

  • add c1ioo to the dolphin network
  • recompile/restart all front end models
  • daqd

I'll try to get the first two of those done tomorrow, although it's unclear what model updates we'll have to do to get things working with the newer RCG.

 

  13132   Sun Jul 23 15:00:28 2017 JamieOmnistructureVACstrange sound around X arm vacuum pumps

While walking down to the X end to reset c1iscex I heard what I would call a "rythmic squnching" sound coming from under the turbo pump.  I would have said the sound was coming from a roughing pump, but none of them are on (as far as I can tell).

Steve maybe look into this??

  13136   Mon Jul 24 10:59:08 2017 JamieUpdateCDSc1iscex models died
Quote:

This morning, all the c1iscex models were dead. Attachment #1 shows the state of the cds overview screen when I came in. The machine itself was ssh-able, so I just restarted all the models and they came back online without fuss.

This was me.  I had rebooted that machine and hadn't restarted the models.  Sorry for the confusion.

  13138   Mon Jul 24 19:28:55 2017 JamieUpdateCDSfront end MX stream network working, glitches in c1ioo fixed

MX/OpenMX network running

Today I got the mx/open-mx networking working for the front ends.  This required some tweaking to the network interface configuration for the diskless front ends, and recompiling mx and open-mx for the newer kernel.  Again, this will all be documented.

controls@fb1:~ 0$ /opt/mx/bin/mx_info
MX Version: 1.2.16
MX Build: root@fb1:/opt/src/mx-1.2.16 Mon Jul 24 11:33:57 PDT 2017
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
    8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0:  364.4 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
    Status:        Running, P0: Link Up
    Network:    Ethernet 10G

    MAC Address:    00:60:dd:43:74:62
    Product code:    10G-PCIE-8B-S
    Part number:    09-04228
    Serial number:    485052
    Mapper:        00:60:dd:43:74:62, version = 0x00000000, configured
    Mapped hosts:    6

                                                        ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                        P0
-----    -----------     ---------                        ---
   0) 00:60:dd:43:74:62 fb1:0                             1,0
   1) 00:30:48:be:11:5d c1iscex:0                         1,0
   2) 00:30:48:bf:69:4f c1lsc:0                           1,0
   3) 00:25:90:0d:75:bb c1sus:0                           1,0
   4) 00:30:48:d6:11:17 c1iscey:0                         1,0
   5) 00:14:4f:40:64:25 c1ioo:0                           1,0
controls@fb1:~ 0$

c1ioo timing glitches fixed

I also checked the BIOS on c1ioo and found that the serial port was enabled, which is known to cause timing glitches.  I turned off the serial port (and some power management stuff), and rebooted, and all the c1ioo timing glitches seem to have gone away.

It's unclear why this is a problem that's just showing up now.  Serial ports have always been a problem, so it seems unlikely this is just a problem with the newer kernel.  Could the BIOS have somehow been reset during the power glitch?

In any event, all the front ends are now booting cleanly, with all dolphin and mx networking coming up automatically, and all models running stably:

Now for daqd...

ELOG V3.1.3-