40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 50 of 344  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  11247   Sat Apr 25 00:20:16 2015 ranaUpdateCDSmegatron python autoMC cron

Upgraded python on megatron. Added lines to the crontab to run autoMX.py. Edited crontab to have a PYTHONPATH so that it can run .py stuff.

But autoMX.py is still not working from inside of cron, just from command line.

  11250   Sat Apr 25 22:17:49 2015 ranaUpdateCDSMXstream restart script working (beta)

Since python from crontab seemed intractableangry, I replaced autoMX.py with a soft link that points at autoMX.sh.

This is a simple BASH scriptcool that looks at the LSC FB stat (C1:DAQ-DC0_C1LSC_STATUS), and runs the restart mxstream script if its non-zero.

So far its run 5 times successfullylaugh. I guess this is good enough for now. Later on, someone ought to make it loop over other FE, but this ought to catch 99% of the FB issues.

  11280   Mon May 11 13:21:25 2015 manasaUpdateCDSc1lsp and c1sup not running

I found the c1lsp and c1sup models not running anymore on c1lsc (white blocks for status lights on medm).

To fix this, I ssh'd into c1lsc. c1lsc status did not show c1lsp and c1sup models running on it.

I tried the usual rtcds restart <model name> for both and that returned error "Cannot start/stop model 'c1XXX' on host c1lsc".

I also tried rtcds restart all on c1lsc, but that has NOT brought back the models alive.

Does anyone know how I can fix this??

c1sup runs some the suspension controls. So I am afraid that the drift and frequent unlocking of the arms we see might be related to this.

 

P.S. We might also want to add the FE status channels to the summary pages.

  11282   Mon May 11 14:08:19 2015 manasaUpdateCDSc1lsp and c1sup removed?

I just found out that c1lsp and c1sup models no more exist on the FE status medm screens. I am assuming some changes were done to the models as well.

Earlier today, I was looking at some of the old medm screens running on Donatella that did not reflect this modification. 

Did I miss any elogs about this or was this change not elogged??

Quote:

I found the c1lsp and c1sup models not running anymore on c1lsc (white blocks for status lights on medm).

To fix this, I ssh'd into c1lsc. c1lsc status did not show c1lsp and c1sup models running on it.

I tried the usual rtcds restart <model name> for both and that returned error "Cannot start/stop model 'c1XXX' on host c1lsc".

I also tried rtcds restart all on c1lsc, but that has NOT brought back the models alive.

Does anyone know how I can fix this??

c1sup runs some the suspension controls. So I am afraid that the drift and frequent unlocking of the arms we see might be related to this.

 

P.S. We might also want to add the FE status channels to the summary pages.

 

  11285   Tue May 12 08:51:08 2015 ericqUpdateCDSc1lsp and c1sup removed?
Quote:

was this change not elogged??

This is my sin.

Back in Febuary (around the 25th) I modified c1sus.mdl, removing the simulated plant connections we weren't using from c1lsp and c1sup. This was included in the model's svn log, but not elogged. blush

The models don't start with the rtcds restart shortcut, because I removed them from the c1lsc line in FB:/diskless/root/etc/rtsystab (or c1lsc:/etc/rtsystab). There is a commented out line in there that can be uncommented to restore them to the list of models c1lsc is allowed to run. 

However, I wouldn't suspect that the models not running should affect the suspension drift, since the connections from them to c1sus have been removed. If we still have trends from early February, we could look and see if the drift was happening before I made this change. 

  11293   Sat May 16 20:37:09 2015 ranaHowToCDSBypassing the CDSUTILS prefix issue

The CDSUTILS package has a feature where it substitutes in a C1 or H1 or L1 prefix depending upon what site you are at. The idea is that this should make code portable between LLO and LHO.

Here at the 40m, we have no need to do that, so its better for us to be able to copy and paste channel names directly from MEDM or whatever without having to remove the "C1:" from all over the place.

the way to do this on the command line is (in bash) to type:

export IFO=''


To make this easier on us, I have implemented this in our shared .bashrc so that its always the case. This might break some scripts which have been adapted to use the weird CDSUTILS convention, so beware and fix appropriately.

  11302   Mon May 18 16:56:12 2015 ericqHowToCDSBypassing the CDSUTILS prefix issue
Quote:

export IFO=''

This makes things act weird:

controls@pianosa|MC 1> z avg 1 "C1:LSC-TRY_OUT"
IFO environment variable not specified.

  11304   Mon May 18 17:44:30 2015 ranaHowToCDSBypassing the CDSUTILS prefix issue

Too weird. I undid me changes. We'll have to make the C1: stuff work inside each python script.

Quote:
Quote:

export IFO=''

This makes things act weird:

controls@pianosa|MC 1> z avg 1 "C1:LSC-TRY_OUT"
IFO environment variable not specified.

 

  11312   Tue May 19 17:03:34 2015 KojiUpdateCDSMXstream restart script working (beta)

AutoMX is resetting mx_stream every 5 minutes. Basically everytime AutoMX is called,
it resets mx_stream. Is the mx_stream stalling such often? Or the script is detecting false alerms?


> tail -200 /opt/rtcds/caltech/c1/scripts/cds/autoMX.log

Tue May 19 16:43:01 PDT 2015
LSC - FB bad. Runnning restart:
 * Stopping mx_stream ...                                                 [ ok ]
 * Starting mx_stream ...                                                 [ ok ]
Connection to c1sus closed.
 * Stopping mx_stream ...                                                 [ ok ]
 * Starting mx_stream ...                                                 [ ok ]
Connection to c1lsc closed.
 * Stopping mx_stream ...                                                 [ ok ]
 * Starting mx_stream ...                                                 [ ok ]
Connection to c1ioo closed.
 * Stopping mx_stream ...                                                 [ ok ]
 * Starting mx_stream ...                                                 [ ok ]
Connection to c1iscex closed.
 * Stopping mx_stream ...                                                 [ ok ]
 * Starting mx_stream ...                                                 [ ok ]
Connection to c1iscey closed.
0
Tue May 19 16:48:02 PDT 2015
LSC - FB bad. Runnning restart:
 * Stopping mx_stream ...                                                 [ ok ]
 * Starting mx_stream ...                                                 [ ok ]
Connection to c1sus closed.
 * Stopping mx_stream ...                                                 [ ok ]
 * Starting mx_stream ...                                                 [ ok ]
Connection to c1lsc closed.
ssh_exchange_identification: read: Connection reset by peer
 * Stopping mx_stream ...                                                 [ ok ]
 * Starting mx_stream ...                                                 [ ok ]
Connection to c1iscex closed.
 * Stopping mx_stream ...                                                 [ ok ]
 * Starting mx_stream ...                                                 [ ok ]
Connection to c1iscey closed.
0
Tue May 19 16:53:01 PDT 2015
LSC - FB bad. Runnning restart:
 * Stopping mx_stream ...                                                 [ ok ]
 * Starting mx_stream ...                                                 [ ok ]
Connection to c1sus closed.
 * Stopping mx_stream ...                                                 [ ok ]
 * Starting mx_stream ...                                                 [ ok ]
Connection to c1lsc closed.
 * Stopping mx_stream ...                                                 [ ok ]
 * Starting mx_stream ...                                                 [ ok ]
Connection to c1ioo closed.
 * Stopping mx_stream ...                                                 [ ok ]
 * Starting mx_stream ...                                                 [ ok ]
Connection to c1iscex closed.
 * Stopping mx_stream ...                                                 [ ok ]
 * Starting mx_stream ...                                                 [ ok ]
Connection to c1iscey closed.

  11314   Tue May 19 18:38:33 2015 ranaUpdateCDSMXstream restart script working (beta)

Good catch; that was some seriously bad programming on my part. I had some undeclared variable garbage going on. I fixed it and re-implemented the script in CRON on megatron. The log file shows that it has detected no problems for the last several checks. I'll check it again tomorrow to see if its doing good or bad.

  11340   Mon Jun 1 10:26:53 2015 ericqUpdateCDSRCG Upgrade Imminent

We are planning on upgrading the 40 CDS system to the latest version of the LIGO RCG software, in three weeks when Jamie is back in town. 

Jamie and I spoke last week, to hash out a general plan for upgrade and preperations I can make in the meantime. 


Preparation of our models for the upgrade includes;

  • Check if any of the default RCG parts (filter modules, etc.) have substantially different default behavior / ports
  • Cleaning up unterminated/hanging connections in the diagrams (Jamie tells me RCG is more strict about this now)
  • Going through all of the build logs for our models to find what custom blocks are being pulled in from the userapps svn
  • Confirm all of our running blocks and models are committed to svn 
  • In a new, isolated, folder, checkout the latest version of the userapps repo
  • See what blocks have changed, and what model changes might be neccesary. 

Once we think we know what needs to be changed in our models, we can check out the latest version of the RCG source, without linking it as the active version, and create a new build directory, without touching the old one, and create new copies of the 40m models with the neccesary modifications. This way, we can work on getting all of the 40m models compiled, without touching any of the live, running, systems

Once our models are compiling successfully, we can work on building daqd, nds, mxstream, etc. 


Additionally, we want to have some set of tests and diagnostics, to make sure we have not introduced unwanted behavior. 

To this end I will create some test models and DTT templates, where a series of measurements can be run, like

  • OLTF/delay measurement of a single all-digital loop within one model
  • OLTF/delay measurements of a few all-digital loops split across two models, using IPC communcation, RFM, dolphin
  • Hook up DAC -> resistor/amplifier/??? -> ADC, to check things like DAC output, ADC noise levels, IOP delays.

I'll run these test before touching anything, and make sure I understand all of the results, so that an apples-to-apples comparison can be made after the upgrade is complete. 


Updates will be posted as I hash things out. I'm sure we have not yet thought of everything to think about and test, so ideas and feedback are very welcome. 

  11342   Mon Jun 1 20:05:36 2015 ranaUpdateCDSRCG Upgrade Imminent

yes

Quote:

 

Additionally, we want to have some set of tests and diagnostics, to make sure we have not introduced unwanted behavior. 

To this end I will create some test models and DTT templates, where a series of measurements can be run, like

  • OLTF/delay measurement of a single all-digital loop within one model
  • OLTF/delay measurements of a few all-digital loops split across two models, using IPC communcation, RFM, dolphin
  • Hook up DAC -> resistor/amplifier/??? -> ADC, to check things like DAC output, ADC noise levels, IOP delays.

I'll run these test before touching anything, and make sure I understand all of the results, so that an apples-to-apples comparison can be made after the upgrade is complete. 

I got goosebumps just imagining this.

  11347   Mon Jun 8 15:51:31 2015 ericqUpdateCDSRCG Diagnostics

I've started making some model changes for RCG diagnostic tests. 

I put some blocks down in C1TST and C1RFM to test the delays of all-digital loops and one loop with a direct DAC->ADC (which currently uses a janky 1-pin lemo -> BNC -> 2-pin lemo situation (which will be improved)). 

Here's what C1TST looks like now. 

I've taken TFs of all three loops. The all digital loops are flat on the order of microdBs. 

The delay in loop A (single loop, one model) is consistent with one 16k cycle, plus or minus 0.25 nsec. 

The delay in loop C (single loop, two models connected via RFM) is consistent with two 16k cycles, plus or minus 0.5 nsec. 

I haven't yet grabbed the whitening and AA/AI shapes for loopB, to calibrate the real delay. 

All of these files currently live in /users/ericq/2015-06-CDSdiag, but I'll make somewhere outside of the user directory to collect all of these tests soon. 

Attachment 1: newTST.png
newTST.png
  11358   Mon Jun 15 11:54:44 2015 ericqUpdateCDSParts not in SVN

I ran the following command to find which models/parts are not under version control, or have modifications not commited:

grep "mdl" $(cat models.txt) | awk '{print $NF}' | sort | uniq | xargs svn status

models.txt includes lines like "/opt/rtcds/caltech/c1/rtbuild/c1ass.log" for each running model. These are the build logs that detail every file being sourced. 

The resultant list is:

?       /opt/rtcds/userapps/release/cds/common/models/BLRMS_HIGHFREQ.mdl
C       /opt/rtcds/userapps/release/cds/common/models/rtdemod.mdl
M       /opt/rtcds/userapps/release/cds/common/models/SCHMITTTRIGGER.mdl
?       /opt/rtcds/userapps/release/isc/c1/models/blrms.mdl
?       /opt/rtcds/userapps/release/isc/c1/models/IQLOCK.mdl
?       /opt/rtcds/userapps/release/isc/c1/models/PHASEROT.mdl
?       /opt/rtcds/userapps/release/sus/c1/models/QPD_WHITE_CTRL_MUX.mdl

I will commit the uncommited c1 parts, and think about what to do about the rtdemod and SCHMITTRIGGER parts. 

Once I check out the latest revision of the userapps repo (in a seperate location), I will do something similar to look for models that have changed since the svn revision that is checked out in our running system. 

  11365   Fri Jun 19 03:00:56 2015 ericqUpdateCDSLibrary Model Parts examined

All simulink diagrams being used at the 40m are now under version control. I have compiled, installed, and restarted all current models to make sure that the files are all in a working state, which they seem to be. I have checked the latest version of the userapps svn repository to /opt/rtcds/userapps2.9, to compare the files therein with our current state. 

Surprisingly, only two files in the userapps svn have been changed since they were checked out here, and only one of these is a real change of any kind. 

LSC_TRIGGER.mdl was edited at some point simply to align some drawn lines; no functionality was changed. 

SCHMITTTRIGGER.mdl was edited to change the "INVERT" epics channel from an arbitrary EPICS input, to binary (true/false) input. This does not change the connectivity diagram, and in fact, I don't think we use this option in any of our scripts, nor is it exposed on our medm screens. 

Thus, I think that the only place for block changes can bite us is changes in the fundamental blocks in CDS_PARTS that are used in our custom 40m library parts. 

For posterity, these are the files used in compiling all of our running models. (Path base: /opt/rtcds/userapps/release)

/isc/common/models/CALIBRATION.mdl
/isc/common/models/FILTBANK_TRIGGER.mdl
/isc/common/models/LSC_TRIGGER.mdl
/isc/common/models/QPD.mdl
/cds/common/models/FILTBANK_MASK.mdl
/cds/common/models/lockin.mdl
/cds/common/models/rtbitget.mdl
/cds/common/models/rtdemod.mdl
/cds/common/models/SCHMITTTRIGGER.mdl
/cds/common/models/SQRT_SWITCH.mdl
/cds/c1/models/c1rfm.mdl
/cds/c1/models/c1tst.mdl
/cds/c1/models/c1x01.mdl
/cds/c1/models/c1x02.mdl
/cds/c1/models/c1x03.mdl
/cds/c1/models/c1x04.mdl
/cds/c1/models/c1x05.mdl
/isc/c1/models/ALS_END.mdl
/isc/c1/models/BLRMS_40M.mdl
/isc/c1/models/BLRMS_HIGHFREQ.mdl
/isc/c1/models/blrms.mdl
/isc/c1/models/c1als.mdl
/isc/c1/models/c1ass.mdl
/isc/c1/models/c1asx.mdl
/isc/c1/models/c1cal.mdl
/isc/c1/models/c1ioo.mdl
/isc/c1/models/c1lsc.mdl
/isc/c1/models/c1oaf.mdl
/isc/c1/models/c1pem.mdl
/isc/c1/models/IQLOCK.mdl
/isc/c1/models/IQ_TO_MAGPHASE.mdl
/isc/c1/models/PHASEROT.mdl
/isc/c1/models/RF_PD_WITH_WHITENING_TRIGGERING.mdl
/isc/c1/models/SENSMAT_LOCKINS.mdl
/isc/c1/models/TT_CONTROL.mdl
/isc/c1/models/UGF_SERVO_40m.mdl
/sus/c1/models/c1mcs.mdl
/sus/c1/models/c1scx.mdl
/sus/c1/models/c1scy.mdl
/sus/c1/models/c1sus.mdl
/sus/c1/models/lib/sus_single_control.mdl
/sus/c1/models/QPD_WHITE_CTRL_MUX.mdl
  11380   Fri Jun 26 23:18:52 2015 Eve ChaseUpdateCDSSummary Page Updates

Motivation:

My SURF project largely focuses on updating and improving the 40m summary pages. I began to explore and experiment with the already existing summary page code to better understand the required code and eventually lead to tangible improvements of the summary pages.

 

What I did:

I practiced moving from one tab of the summary pages to another. Specifically, I copied the ETM Optical Levers plot from SUS: OpLev to Sandbox, without removing it from its original location. Additionally, I created my own tab to further test the summary pages, titled “Eve”.

KA ed: The configuration files are located at /cvs/cds/caltech/chans/GWsummaries. It is under svn control.

Result:

All changed appeared on the summary pages without much hassle and without any errors.

  11384   Tue Jun 30 11:33:00 2015 JamieSummaryCDSprepping for CDS upgrade

This is going to be a big one.  We're at version 2.5 and we're going to go to 2.9.3.

RCG components that need to be updated:

  • mbuf kernel module
  • mx_stream driver
  • iniChk.pl script
  • daqd
  • nds

Supporting software:

  • EPICS 3.14.12.2_long
  • ldas-tools (framecpp) 1.19.32-p1
  • libframe 8.17.2
  • gds 2.16.3.2
  • fftw 3.3.2

Things to watch out for:

  • RTS 2.6:
    • raw minute trend frame location has changed (CRC-based subdirectory)
    • new kernel patch
  • RTS 2.7:
    • supports "commissioning frames", which we will probably not utilize.  need to make sure that we're not writing extra frames somewhere
  • RTS 2.8:
    • "slow" (EPICS) data from the front-end processes is acquired via DAQ network, and not through EPICS.  This will increase traffic on the DAQ lan.  Hopefully this will not be an issue, and the existing network infrastructure can handle it, but it should be monitored.
  11390   Wed Jul 1 19:16:21 2015 JamieSummaryCDSCDS upgrade in progress

The CDS upgrade is now underway

Here's what's happened so far:

  • Installed and linked in all the RTS supporting software packages in /opt/rtapps (only on front end machines and fb):
    controls@c1lsc ~ 2$ find /opt/rtapps/ -mindepth 1 -maxdepth 1 -type l -ls
    12582916    0 lrwxrwxrwx   1 controls 1001           12 Jul  1 13:16 /opt/rtapps/gds -> gds-2.16.3.2
    12603452    0 lrwxrwxrwx   1 controls 1001           10 Jul  1 13:17 /opt/rtapps/fftw -> fftw-3.3.2
    12603451    0 lrwxrwxrwx   1 controls 1001           15 Jul  1 13:16 /opt/rtapps/libframe -> libframe-8.17.2
    12603450    0 lrwxrwxrwx   1 controls 1001           13 Jul  1 13:16 /opt/rtapps/libmetaio -> libmetaio-8.2
    12582915    0 lrwxrwxrwx   1 controls 1001           34 Jul  1 15:24 /opt/rtapps/framecpp -> ldas-tools-1.19.32-p1/linux-x86_64
    12582914    0 lrwxrwxrwx   1 controls 1001           20 Jul  1 13:15 /opt/rtapps/epics -> epics-3.14.12.2_long
  • Checked out the RTS source for the version we'll be using: 2.9.4

/opt/rtcds/rtscore/tags/advLigoRTS-2.9.4

  • built and installed all of the RTS components:
    • mbuf
    • mx_stream
    • daqd
    • nds
    • awgtpman
       
  • mx_stream is not working. Unknown why. It won't start on the front end machines (only tested on c1lsc so far) with the following error:
    controls@c1lsc ~ 1$ /opt/rtcds/caltech/c1/target/fb/mx_stream -s c1x04 c1lsc c1ass c1oaf c1cal -d fb:0
    mmapped address is 0x7ff7b71a0000
    send len = 263596
    mx_connect failed Remote Endpoint is Closed
    controls@c1lsc ~ 1$
    
    Have contact Keith T. and Rolf B. for backup.  This is a blocker, since this is what ferries the data from the front ends.
     
  • Rebuilt almost all models.  This was good.  Initially nothing would compile because of IPC creation errors, so I moved the old chans/ipc/C1.ipc file out of the way and generated a new one and then everything compiled (of course senders have to be compiled before receivers).
    I only had to fix a couple of things in the models themselves:
    • c1ioo - unterminated FiltCtrl inputs
    • C1_SUS_SINGLE_CONTROL - unterminated FiltCtrl inputs
    • c1oaf - bad part named "STATIC". There is some hacky namespace stuff going on in the RCG. I was able to just explode that part and it now works.
    • c1lsc - unterminated FiltCtrl inputs
    Haven't installed or tried to run anything yet, but the fact they compile is good.
    Some models are not compiling because they have C code in src blocks that are throwing errors:
    • c1lsc
    • c1cal
    It shouldn't be too hard to fix whatever is causing those compile errors.

That's it for today.  Will pick up again first thing tomorrow

  11393   Tue Jul 7 18:27:54 2015 JamieSummaryCDSCDS upgrade: progress!

After a couple of days of struggle, I made some progress on the CDS upgrade today:

Front end status:

  • RTS upgraded to 2.9.4, and linked in as "release":

/opt/rtcds/rtscore/release -> tags/advLigoRTS-2.9.4

  • mbuf kernel module built installed
  • All front ends have been rebooted with the latest patched kernel (from 2.6 upgrade)
  • All models have been rebuilt, installed, restarted.  Only minor model issues had to be corrected (unterminated unused inputs mostly).
  • awgtpman rebuilt, and installed/running on all front-ends
  • open-mx upgraded to 1.5.2:

/opt/open-mx -> open-mx-1.5.2

  • All front ends running latest version of mx_stream, built against 2.9.4 and open-mx-1.5.2.

We have new GDS overview screens for the front end models:

It's possible that our current lack of IRIG-B GPS distribution means that the 'TIM' status bit will always be red on the IOP models.  Will consult with Rolf.

There are other new features in the front ends that I can get into later.

DAQ (fb) status:

  • daqd and nds rebuilt against 2.9.4, both now running on fb

40m daqd compile flags:

cd src/daqd
./configure --enable-debug --disable-broadcast --without-myrinet --with-mx --enable-local-timing --with-epics=/opt/rtapps/epics/base --with-framecpp=/opt/rtapps/framecpp
make
make clean
install daqd /opt/rtcds/caltech/c1/target/fb/

However, daqd has unfortunately been very unstable, and I've been trying to figure out why.  I originally thought it was some sort of timing issue, but now I'm not so sure.

I had to make the following changes to the daqdrc:

set gps_leaps = 820108813 914803214 1119744016;

That enumerates some list of leap seconds since some time.  Not sure if that actually does anything, but I added the latest leap seconds anyway:

set symm_gps_offset=315964803;

This updates the silly, arbitrary GPS offset, that is required to be correct when not using external GPS reference.

Finally, the last thing I did that finally got it running stably was to turn off all trend frame writing:

# start trender;
# start trend-frame-saver;
# sync trend-frame-saver;
# start minute-trend-frame-saver;
# sync minute-trend-frame-saver;
# start raw_minute_trend_saver;

For whatever reason, it's the trend frame writing that that was causing things daqd to fall over after a short amount of time.  I'll continue investigating tomorrow.

 

We still have a lot of cleanup burt restores, testing, etc. to do, but we're getting there.

  11394   Tue Jul 7 23:26:19 2015 KojiUpdateCDSAttempt to list CDS issues

As Jamie succeded to realize somewhat workable condition of the 40m CDS, I tried to list the obvious CDS issues so that we can attack them one by one.

  1. c1cal is constantly time-outing now (t>60usec). c1sus is close to it (t=56~57us)
  2. We should check the trends of the CPU meters ("C1:FEC-**_CPU_METER"). In fact this should be listed in the summary pages in a new CDS tab.
  3. Probably this is related to 1): c1lsc is constantly showing IPC error (bit0 = shmem). C1LSC_IPC_STATUS.adl is telling that this is coming from the IPC error between c1lsc and c1cal. ("C1:CAL-LSC_SENSMAT_OSC_****"). This information is found by opning C1LSC_GDS_TP.adl screen and click RT NET STAT button next to the IPC error status.
  4. We wonder how the RFM access is accelerated or decelerated by this upgrade.
  5. We need tests to see if the time delays of the models/IPCs are still reasonable.
  6. LSC Overviw screen has a small digest of the CDS status. Now there are many white boxes that correspond to the channels "C1:FEC-**_DIAG1".
  7. All realtime systems have default (0 or 1) epics channel values (i.e. gains, FM switches, matrices, etc). Need burtrestores.
  8. I tried to burtrestore the models but burtgooey indicated there are some errors.
  9. Detailed check of the snapshot files comparing snapshot files in /opt/rtcds/caltech/c1/burt/autoburt/snapshots/2015/Jul/7/19:07 and /opt/rtcds/caltech/c1/burt/autoburt/snapshots/2015/Jun/1/19:07 :
    • c1alsepics shows bunch of volatile channels to be snapshot. It seems that all of the static epics channels are missing in the snapshot file. Is this related to the current omission of the slow data acquisition? => No actually this must be the modification of the ALS model to accommodate the ALS in the LSC model for the new ALS setup.
    • c1lscepics was checked indeed slow channels were properly snapshot. So what was the problem in burting???
    • I made a simple csh script to restore the snapshots one by one while collecting the error messages.
      This script is located as /users/koji/150707/burtrevert.sh
    • #!/bin/csh
      echo 'This script restores all of the snapshot files found in' $argv[1] '.'
      echo 'Are you sure? y/n'

      set ans = $<

      set ANS = `echo $ans | tr "[:upper:]" "[:lower:]" `
      if ($ANS == y) then
          foreach fname ($argv[1]/*epics.snap)
          echo ''
          echo '#################################'
          echo $fname
          echo '#################################'

              burtwb -f $fname
          end
      else
          echo "exiting..."
      endif
       
    •  Now I ran the command
      ./burtrevert.sh /opt/rtcds/caltech/c1/burt/autoburt/snapshots/2015/Jun/1/19:07 &>burt.log
      This lists up the missing channels. The zipped log is attached to this entry.
    • Burting old snapshot always crashes the RT process "c1sus" (not the c1sus host). If I use the newly generated snapshot today,
      the process does not crash. The process halts at the cycle time of 74us (>60us). I left the process crashed so that we can take a new snapshot with the matrix numbers filled. Once we have the correct snapshot, we don't need to worry about this crash. Let's see.
    • c1sus still crashes with the new burt file. Theremust be a trigeer that makes the model frozen. We need to split the burtfile into pieces
      to figure out which line causes the halt.
Attachment 1: burt.log.zip
  11396   Wed Jul 8 20:37:02 2015 JamieSummaryCDSCDS upgrade: one step forward, two steps back

After determining yesterday that all the daqd issues were coming from the frame writing, I started to dig into it more today.  I also spoke to Keith Thorne, and got some good suggestions from Gerrit Kuhn at GEO.

I  realized that it probably wasn't the trend writing per se, but that turning on more writing to disk was causing increased load on daqd, and consequently on the system itself.  With more frame writing turned on the memory consuption increased to the point of maxing out the physical RAM.  The system the probably starting swaping, which certainly would have choked daqd.

I noticed that fb only had 4G of RAM, which Keith suggested was just not enough.  Even if the memory consumption of daqd has increased significantly, it still seems like 4G would not be enough.  I opened up fb only to find that fb actually had 8G of RAM installed!  Not sure what happend to the other 4G, but somehow they were not visible to the system.  Koji and I eventually determined, via some frankenstein operations with megatron, that the RAM was just dead.  We then pulled 4G of RAM from megatron and replaced the bad RAM in fb, so that fb now has a full 8G of RAM cool.

Unfortunately, when we got fb fully back up and running we found that fb is not able to see any of the other hosts on the data concentrator network sad.  mx_info, which displays the card and network status for the myricom myrinet fiber card, shows:

MX Version: 1.2.16
MX Build: controls@fb:/opt/src/mx-1.2.16 Tue May 21 10:58:40 PDT 2013
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
    8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0:  299.8 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
    Status:        Running, P0: Wrong Network
    Network:    Myrinet 10G

    MAC Address:    00:60:dd:46:ea:ec
    Product code:    10G-PCIE-8AL-S
    Part number:    09-03916
    Serial number:    352143
    Mapper:        00:60:dd:46:ea:ec, version = 0x63e745ee, configured
    Mapped hosts:    1

                                                        ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                        P0
-----    -----------     ---------                        ---
   0) 00:60:dd:46:ea:ec fb:0                            D 0,0

Note that all front end machines should be listed in the table at the bottom, and they're not.   Also note the "Wrong Network" note in the Status line above.  It appears that the card has maybe been initialized in a bad state?  Or Koji and I somehow disturbed the network when we were cleaning up things in the rack.  "sudo /etc/init.d/mx restart" on fb doesn't solve the problem.  We even rebooted fb and it didn't seem to help.

In any event, we're back to no data flow.  I'll pick up again tomorrow.

  11397   Wed Jul 8 21:02:02 2015 JamieSummaryCDSCDS upgrade: another step forward, so we're back to where we started (plus a bit?)

Koji did a bit of googling to determine that 'Wrong Network' status message could be explained by the fb myrinet  operating in the wrong mode:
(This was the useful link to track down the issue (KA))
 

    Network:    Myrinet 10G

I didn't notice it before, but we should in fact be operating in "Ethernet" mode, since that's the fabric we're using for the DC network.  Digging a bit deeper we found that the new version of mx (1.2.16) had indeed been configured with a different compile option than the 1.2.15 version had:

controls@fb ~ 0$ grep '$ ./configure' /opt/src/mx-1.2.15/config.log          
  $ ./configure --enable-ether-mode --prefix=/opt/mx
controls@fb ~ 0$ grep '$ ./configure' /opt/src/mx-1.2.16/config.log
  $ ./configure --enable-mx-wire --prefix=/opt/mx-1.2.16
controls@fb ~ 0$

So that would entirely explain the problem.  I re-linked mx to the older version (1.2.15), reloaded the mx drivers, and everything showed up correctly:

controls@fb ~ 0$ /opt/mx/bin/mx_info
MX Version: 1.2.12
MX Build: root@fb:/root/mx-1.2.12 Mon Nov  1 13:34:38 PDT 2010
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
    8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0:  299.8 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
    Status:        Running, P0: Link Up
    Network:    Ethernet 10G

    MAC Address:    00:60:dd:46:ea:ec
    Product code:    10G-PCIE-8AL-S
    Part number:    09-03916
    Serial number:    352143
    Mapper:        00:60:dd:46:ea:ec, version = 0x00000000, configured
    Mapped hosts:    6

                                                        ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                        P0
-----    -----------     ---------                        ---
   0) 00:60:dd:46:ea:ec fb:0                              1,0
   1) 00:25:90:0d:75:bb c1sus:0                           1,0
   2) 00:30:48:be:11:5d c1iscex:0                         1,0
   3) 00:30:48:d6:11:17 c1iscey:0                         1,0
   4) 00:30:48:bf:69:4f c1lsc:0                           1,0
   5) 00:14:4f:40:64:25 c1ioo:0                           1,0
controls@fb ~ 0$

The front end hosts are also showing good omx info (even though they had been previously as well):

controls@c1lsc ~ 0$ /opt/open-mx/bin/omx_info
Open-MX version 1.5.2
 build: controls@fb:/opt/src/open-mx-1.5.2 Tue May 21 11:03:54 PDT 2013

Found 1 boards (32 max) supporting 32 endpoints each:
 c1lsc:0 (board #0 name eth1 addr 00:30:48:bf:69:4f)
   managed by driver 'igb'

Peer table is ready, mapper is 00:30:48:d6:11:17
================================================
  0) 00:30:48:bf:69:4f c1lsc:0
  1) 00:60:dd:46:ea:ec fb:0
  2) 00:25:90:0d:75:bb c1sus:0
  3) 00:30:48:be:11:5d c1iscex:0
  4) 00:30:48:d6:11:17 c1iscey:0
  5) 00:14:4f:40:64:25 c1ioo:0
controls@c1lsc ~ 0$

This got all the mx_stream connections back up and running.

Unfortunately, daqd is back to being a bit flaky.  With all frame writing enabled we saw daqd crash again.  I then shut off all trend frame writing and we're back to a marginally stable state: we have data flowing from all front ends, and full frames are being written, but not trends.

I'll pick up on this again tomorrow, and maybe try to rebuild the new version of mx with the proper flags.

  11398   Thu Jul 9 13:26:47 2015 JamieSummaryCDSCDS upgrade: new mx 1.2.16 installed

I rebuilt/installed mx 1.2.16 to use "ether-mode", instead of the default MX-10G:

controls@fb /opt/src/mx-1.2.16 0$ ./configure --enable-ether-mode --prefix=/opt/mx-1.2.16
...
controls@fb /opt/src/mx-1.2.16 0$ make
..
controls@fb /opt/src/mx-1.2.16 0$ make install
...

I then rebuilt/installed daqd so that it properly linked against the updated mx install:

controls@fb /opt/rtcds/rtscore/release/src/daqd 0$ ./configure --enable-debug --disable-broadcast --without-myrinet --with-mx --with epics=/opt/rtapps/epics/base --with-framecpp=/opt/rtapps/framecpp --enable-local-timing
...
controls@fb /opt/rtcds/rtscore/release/src/daqd 0$ make
...
controls@fb /opt/rtcds/rtscore/release/src/daqd 0$ install daqd /opt/rtcds/caltech/c1/target/fb/

It's now back to running and receiving data from the front ends (still not stable yet, though).

  11400   Thu Jul 9 16:50:13 2015 JamieSummaryCDSCDS upgrade: if all else fails try throwing metal at the problem

I roped Rolf into coming over and adding his eyes to the problem.  After much discussion we couldn't come up with any reasonable explanation for the problems we've been seeing other than daqd just needing a lot more resources that it did before.  He said he had some old Sun SunFire X4600s from which we could pilfer memory.  I went over to Downs and ripped all the CPU/memory cards out of one of his machines and stuffed them into fb:

fb now has 8 CPU and 16G of RAM

Unfortunately, this is still not enough.  Or at least it didn't solve the problem; daqd is showing the same instabilities, falling over a couple of minutes after I turn on trend frame writing.  As always, before daqd fails it starts spitting out the following to the logs:

[Thu Jul  9 16:37:09 2015] main profiler warning: 0 empty blocks in the buffer

followed by lines like:

[Thu Jul  9 16:37:27 2015] GPS MISS dcu 44 (ASX); dcu_gps=1120520264 gps=1120519812

right before it dies.

I'm no longer convinced that this is a resource issue, though, judging by the resource usage right before the crash:

top - 16:47:32 up 48 min,  5 users,  load average: 0.91, 0.62, 0.61
Tasks:   2 total,   0 running,   2 sleeping,   0 stopped,   0 zombie
Cpu(s):  8.9%us,  0.9%sy,  0.0%ni, 89.1%id,  0.9%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  15952104k total, 13063468k used,  2888636k free,   138648k buffers
Swap:  1023996k total,        0k used,  1023996k free,  7672292k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12016 controls  20   0 8098m 4.4g 104m S  106 29.1   6:45.79 daqd
 4953 controls  20   0 53580 6092 5096 S    0  0.0   0:00.04 nds

Load average less than 1 per CPU, plenty of free memory (~3G free, 0 swap), no waiting for IO (0.9%wa), etc.  daqd is utilizing lots of  threads, which should be spread across many cpus, so even the >100%CPU should be ok.   I'm at a loss...

  11402   Mon Jul 13 01:11:14 2015 JamieSummaryCDSCDS upgrade: current assessment

daqd is still behaving unstably.  It's still unclear what the issue is.

The current failures look like disk IO contention.  However, it's hard to see any evidince of daqd is suffering from large IO wait while it's failing.

The frame size itself is currently smaller than it was before the upgrade:

controls@fb /frames/full 0$ ls -alth 11190 | head
total 369G
drwxr-xr-x 321 controls controls  36K Jul 12 22:20 ..
drwxr-xr-x   2 controls controls 268K Jun 23 06:06 .
-rw-r--r--   1 controls controls  67M Jun 23 06:06 C-R-1119099984-16.gwf
-rw-r--r--   1 controls controls  68M Jun 23 06:06 C-R-1119099968-16.gwf
-rw-r--r--   1 controls controls  69M Jun 23 06:05 C-R-1119099952-16.gwf
-rw-r--r--   1 controls controls  69M Jun 23 06:05 C-R-1119099936-16.gwf
-rw-r--r--   1 controls controls  67M Jun 23 06:05 C-R-1119099920-16.gwf
-rw-r--r--   1 controls controls  68M Jun 23 06:05 C-R-1119099904-16.gwf
-rw-r--r--   1 controls controls  68M Jun 23 06:04 C-R-1119099888-16.gwf
controls@fb /frames/full 0$ ls -alth 11208 | head
total 17G
drwxr-xr-x   2 controls controls  20K Jul 13 01:00 .
-rw-r--r--   1 controls controls  45M Jul 13 01:00 C-R-1120809632-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 01:00 C-R-1120809408-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 00:56 C-R-1120809392-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 00:56 C-R-1120809376-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 00:56 C-R-1120809360-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 00:55 C-R-1120809344-16.gwf
-rw-r--r--   1 controls controls  50M Jul 13 00:55 C-R-1120809328-16.gwf
controls@fb /frames/full 0$

This would seem to indicate that it's not an increase in frame size that's to blame.

Because slow data is now transported to daqd over the MX data concentrator network rather than via EPICS (RTS 2.8), there is more network on the MX network.   I note also that the channel lists have increased in size:

controls@fb /opt/rtcds/caltech/c1/chans/daq 0$ ls -alt archive/C1LSC* | head -20
-rw-r--r-- 1 4294967294 4294967294 262554 Jul  6 18:21 archive/C1LSC_150706_182146.ini
-rw-r--r-- 1 4294967294 4294967294 262554 Jul  6 18:16 archive/C1LSC_150706_181603.ini
-rw-r--r-- 1 4294967294 4294967294 262554 Jul  6 16:09 archive/C1LSC_150706_160946.ini
-rw-r--r-- 1 4294967294 4294967294  43366 Jul  1 16:05 archive/C1LSC_150701_160519.ini
-rw-r--r-- 1 4294967294 4294967294  43366 Jun 25 15:47 archive/C1LSC_150625_154739.ini
...

I would have thought, though, that data transmission errors would show up in the daqd status bits.

  11404   Mon Jul 13 18:12:50 2015 JamieSummaryCDSCDS upgrade: left running in semi-stable configuration

I have been watching daqd all day and I don't feel particularly closer to understanding what the issues are.  However, things are

Interestingly, though, the stability appears highly variable at the moment.  This morning, daqd was very unstable and was crashing within a couple of minutes of starting.  However this afternoon, things seemed much more stable.  As of this moment, daqd has been running for for 25 minutes now, writing full frames as well as minute and second trends (no minute_raw), without any issues.  What has changed?

To reiterate, I have been closing watching disk IO to /frames.  I see no indication that there is any disk contention while daqd is failing.  It's still possible, though, that there are disk IO issues affecting daqd at a level that is not readily visible.  From dstat, the frame writes are visible, but nothing else.

I have made one change that could be positively affecting things right now: I un-exported /frames from NFS.  This eliminates anything external from reading /frames over the network.  In particular, it also shuts off the transfer of frames to LDAS.  Since I've done this, daqd has appeared to be more stable.  It's NOT totally stable, though, as the instance that I described above did eventually just die after 43 minutes, as I was writing this.

In any event, as things are currently as stable as I've seen them, I'm leaving it running in this configuration for the moment, with the following relevant daqdrc parameters:

start main 16;
start frame-saver;
sync frame-saver;
start trender 60 60;
start trend-frame-saver;
sync trend-frame-saver;
start minute-trend-frame-saver;
sync minute-trend-frame-saver;
start profiler;
start trend profiler;
  11406   Tue Jul 14 09:08:37 2015 JamieSummaryCDSCDS upgrade: left running in semi-stable configuration

Overnight daqd restarted itself only about twice an hour, which is an improvement:

controls@fb /opt/rtcds/caltech/c1/target/fb 0$ tail logs/restart.log
daqd: Tue Jul 14 03:13:50 PDT 2015
daqd: Tue Jul 14 04:01:39 PDT 2015
daqd: Tue Jul 14 04:09:57 PDT 2015
daqd: Tue Jul 14 05:02:46 PDT 2015
daqd: Tue Jul 14 06:01:57 PDT 2015
daqd: Tue Jul 14 06:43:18 PDT 2015
daqd: Tue Jul 14 07:02:19 PDT 2015
daqd: Tue Jul 14 07:58:16 PDT 2015
daqd: Tue Jul 14 08:02:44 PDT 2015
daqd: Tue Jul 14 09:02:24 PDT 2015

Un-exporting /frames might have helped a bit.  However, the problem is obviously still not fixed.

  11408   Tue Jul 14 10:28:02 2015 ericqSummaryCDSCDS upgrade: left running in semi-stable configuration

There remains a pattern to some of the restarts, the following times are all reported as restart times. (There are others in between, however.)

daqd: Tue Jul 14 00:02:48 PDT 2015
daqd: Tue Jul 14 01:02:32 PDT 2015
daqd: Tue Jul 14 03:02:33 PDT 2015
daqd: Tue Jul 14 05:02:46 PDT 2015
daqd: Tue Jul 14 06:01:57 PDT 2015
daqd: Tue Jul 14 07:02:19 PDT 2015
daqd: Tue Jul 14 08:02:44 PDT 2015
daqd: Tue Jul 14 09:02:24 PDT 2015
daqd: Tue Jul 14 10:02:03 PDT 2015

Before the upgrade, we suffered from hourly crashes too:

daqd_start Sun Jun 21 00:01:06 PDT 2015
daqd_start Sun Jun 21 01:03:47 PDT 2015
daqd_start Sun Jun 21 02:04:04 PDT 2015
daqd_start Sun Jun 21 03:04:35 PDT 2015
daqd_start Sun Jun 21 04:04:04 PDT 2015
daqd_start Sun Jun 21 05:03:45 PDT 2015
daqd_start Sun Jun 21 06:02:43 PDT 2015
daqd_start Sun Jun 21 07:04:42 PDT 2015
daqd_start Sun Jun 21 08:04:34 PDT 2015
daqd_start Sun Jun 21 09:03:30 PDT 2015
daqd_start Sun Jun 21 10:04:11 PDT 2015

So, this isn't neccesarily new behavior, just something that remains unfixed. 

  11409   Tue Jul 14 11:57:27 2015 jamieSummaryCDSCDS upgrade: left running in semi-stable configuration
Quote:

There remains a pattern to some of the restarts, the following times are all reported as restart times. (There are others in between, however.)

daqd: Tue Jul 14 00:02:48 PDT 2015
daqd: Tue Jul 14 01:02:32 PDT 2015
daqd: Tue Jul 14 03:02:33 PDT 2015
daqd: Tue Jul 14 05:02:46 PDT 2015
daqd: Tue Jul 14 06:01:57 PDT 2015
daqd: Tue Jul 14 07:02:19 PDT 2015
daqd: Tue Jul 14 08:02:44 PDT 2015
daqd: Tue Jul 14 09:02:24 PDT 2015
daqd: Tue Jul 14 10:02:03 PDT 2015

Before the upgrade, we suffered from hourly crashes too:

daqd_start Sun Jun 21 00:01:06 PDT 2015
daqd_start Sun Jun 21 01:03:47 PDT 2015
daqd_start Sun Jun 21 02:04:04 PDT 2015
daqd_start Sun Jun 21 03:04:35 PDT 2015
daqd_start Sun Jun 21 04:04:04 PDT 2015
daqd_start Sun Jun 21 05:03:45 PDT 2015
daqd_start Sun Jun 21 06:02:43 PDT 2015
daqd_start Sun Jun 21 07:04:42 PDT 2015
daqd_start Sun Jun 21 08:04:34 PDT 2015
daqd_start Sun Jun 21 09:03:30 PDT 2015
daqd_start Sun Jun 21 10:04:11 PDT 2015

So, this isn't neccesarily new behavior, just something that remains unfixed. 

That's interesting, that we're still seeing those hourly crashes.

We're not writing out the full set of channels, though, and we're getting more failures than just those at the hour, so we're still suffering.

  11410   Tue Jul 14 13:55:28 2015 jamieUpdateCDSrunning test on daqd, please leave undisturbed

I'm running a test with daqd right now, so please do not disturb for the moment.

I'm temporarily writing frames into a tempfs, which is a filesystem that exists purely in memory.  There should be ZERO IO contention for this filesystem, so if the daqd failures are due to IO then all problems should disappear.  If they don't, then we're dealing with some other problem.

There will be no data saved during this period.

  11412   Tue Jul 14 16:51:01 2015 JamieSummaryCDSCDS upgrade: problem is not disk access

I think I have now determined once and for all that the daqd problems are NOT due to disk IO contention.

I have mounted a tmpfs at /frames/tmp and have told daqd to write frames there.  The tmpfs exists entirely in RAM.  There is essentially zero IO wait for such a filesystem, so daqd should never have trouble writing out the frames.

But yet daqd continues to fail with the "0 empty blocks in the buffer" warnings.  I've been down a rabbit hole.

  11413   Tue Jul 14 17:06:00 2015 jamieUpdateCDSrunning test on daqd, please leave undisturbed

I have reverted daqd to the previous configuration, so that it's writing frames to disk.  It's still showing instability.

  11415   Wed Jul 15 13:19:14 2015 JamieSummaryCDSCDS upgrade: reducing mx end-points as last ditch effort

I tried one last thing, suggested by Keith and Gerrit.  I tried reducing the number of mx end-points on fb to zero, which should reduce the total number of fb threads, in the hope that the extra threads were causing the chokes.

On Tue, Jul 14 2015, Keith Thorne <kthorne@ligo-la.caltech.edu> wrote:
> Assumptions
>  1) Before the upgrade (from RCG 2.6?), the DAQ had been working, reading out front-ends, writing frames trends
>  2) In upgrading to RCG 2.9, the mx start-up on the frame builder was modified to use multiple end-points
> (i.e. /etc/init.d/mx has a line like
> # 1 10G card - X2
> MX_MODULE_PARAMS="mx_max_instance=1 mx_max_endpoints=16 $MX_MODULE_PARAMS"
>  (This can be confirmed by the daqd log file with lines at the top like
> 263596
> MX has 16 maximum end-points configured
> 2 MX NICs available
> [Fri Jul 10 16:12:50 2015] ->4: set thread_stack_size=10240
> [Fri Jul 10 16:12:50 2015] new threads will be created with the stack of size 10
> 240K
>
> If this is the case, the problem may be that the additional thread on the frame-builder (one per end-point) take up so many slots on the 8-core
> frame-builder that they interrupt the frame-writing thread, thus preventing the main buffer from being emptied.  
>
> One could go back to a single end-point. This only helps keep restart of front-end A from hiccuping DAQ for front-end B.
>
> You would have to remove code on front-ends (/etc/init.d/mx_stream) that chooses endpoints. i.e.
> # find line number in rtsystab. Use that to mx_stream slot on card (0-15)
> line_num=`grep -v ^# /etc/rtsystab | grep --perl-regexp -n "^${hostname}\s" | se
> d 's/^\([0-9]*\):.*/\1/g'`
> line_off=$(expr $line_num - 1)
> epnum=$(expr $line_off % 2)
> cnum=$(expr $line_off / 2)
>
>     start-stop-daemon --start --quiet -b -m --pidfile /var/log/mx_stream0.pid --exec /opt/rtcds/tst/x2/target/x2daqdc0/mx_stream -- -e 0 -r "$epnum" -W 0 -w 0 -s "$sys" -d x2daqdc0:$cnum -l /opt/rtcds/tst/x2/target/x2daqdc0/mx_stream_logs/$hostname.log

As per Keith's suggestion, I modified the mx startup script to only initialize a single endpoint, and I modified the mx_stream startup to point them all to endpoint 0.  I verified that indeed daqd was a single MX end-point:

MX has 1 maximum end-points configured

It didn't help.  After 5-10 minutes daqd crashes with the same "0 empty blocks" messages.

I should also mention that I'm pretty sure the start of these messages does not seem coincident with any frame writing to disk; further evidence that it's not a disk IO issue.

Keith is looking at the system now, so we if he can see anything obvious.  If not, I will start reverting to 2.5.

  11417   Wed Jul 15 18:19:12 2015 JamieSummaryCDSCDS upgrade: tentative stabilty?

Keith Thorne provided his eyes on the situation today and had some suggestions that might have helped things

Reorder ini file list in master file.  Apparently the EDCU.ini file (C0EDCU.ini in our case), which describes EPICS subscriptions to be recorded by the daq, now has to be specified *after* all other front end ini files.  It's unclear why, but it has something to do with RTS 2.8 which changed all slow channels to be transported over the mx network.  This alone did not fix the problem, though.

Increase second trend frame size.  Interestingly, this might have been the key.  The second trend frame size was increased to 600 seconds:

start trender 600 60;

The two numbers are the lengths in seconds for the second and minute trends respectively.  They had been set to "60 60", but Keith suggested that longer second trend frames are better, for whatever reason.  It seems he may be right, given that daqd has been running and writing full and trend frames for 1.5 hours now without issue. 


As I'm writing this, though, the daqd just crashed again.  I note, though, that it's right after the hour, and immediately following writing out a one hour minute trend file.  We've been seeing these hour, on the hour, crashes of daqd for quite a while now.  So maybe this is nothing new.  I've actually been wondering if the hourly daqd crashes were associated with writing out the minute trend frames, and I think we might have more evidence to point to that.

If increasing the size of the second trend frames from 60 seconds (35M) to 600 seconds (70M) made a difference in stability, could there be an issue since writing out files that are smaller than some value?  The full frames are 60M, and the minute trends are 35M.

  11427   Sat Jul 18 15:37:19 2015 JamieSummaryCDSCDS upgrade: current status

So it appears we have found a semi-stable configuration for the DAQ system post upgrade:

Here are the issues:

daqd

dadq is running mostly stably for the moment, although it still crashes at the top of every hour (see below).  Here are some relevant points of about the current configuration:

  • recording data from only a subset of front-ends, to reduce the overall load:
    • c1x01
    • c1scx
    • c1x02
    • c1sus
    • c1mcs
    • c1pem
    • c1x04
    • c1lsc
    • c1ass
    • c1x05
    • c1scy
  • 16 second main buffer:
    start main 16;
  • trend lengths: second: 600, minute: 60
    start trender 600 60;
  • writing to frames:
    • full
    • second
    • minute
    • (NOT raw minute trends)
  • frame compression ON

This elliminates most of the random daqd crashing.  However, daqd still crashes at the top of every hour after writing out the minute trend frame. Still unclear what the issue is, but Keith is investigating.  In some sense this is no worse that where we were before the upgrade, since daqd was also crashing hourly then.  It's still crappy, though, so hopefully we'll figure something out.

The inittab on fb automatically restarts daqd after it crashes, and monit on all of the front ends automatically restarts the mx_stream processes.

front ends

The front end modules are mostly running fine.

One issue is that the execution times seem to have increased a bit, which is problematic for models that were already on the hairy edge.  For instance, the rough aversage for c1sus has some from ~48us to 50us.  This is most problematic for c1cal, which is now running at ~66us out of 60, which is obviously untenable.  We'll need to reduce the load in c1cal somehow.

All other front end models seem to be working fine, but a full test is still needed.

There was an issue with the DACs on c1sus, but I rebooted and everything came up fine, optics are now damped:

  11428   Sat Jul 18 16:03:00 2015 jamieUpdateCDSEPICS freezes persist

I notice that the periodic EPICS freezes persist.  They last for 5-10 seconds.  MEDM completely freezes up, but then it comes back.

The sites have been noticing similar issues on a less dramatic scale.  Maybe we can learn from whatever they figure out.

  11429   Sat Jul 18 16:59:01 2015 jamieUpdateCDSunloaded, turned off loading of, symmetricom kernel module on fb

fb has been loading a 'symmetricom' kernel module, presumably because it was once being used to help with timing.  It's no longer needed, so I unloaded it and commented out the lines that loaded it in /etc/conf.d/local.start.

  11435   Wed Jul 22 14:10:04 2015 ericqUpdateCDSRCG Diagnostics

Now that we've seemed to landed on a working configuration, I've re-ran the tests first described in ELOG 11347. I've also compared the real filtering of filter modules to their designs. 

TL;DR: No adverse, or even observable, differences have been witnessed.

As a reminder: In c1tst, there are three loops, called LOOPA, LOOPB, and LOOPC.

  • LOOPA is a filter module feeding back onto its own input, with a unit time delay block
  • LOOPB is a FM whose output goes to the DAC. In meatspace, the AI output is hooked up directly to an AA chassis input, and back to the FM in CDS
  • LOOPC includes RFM connections to c1rfm and back again. 

Here are the loop delay results, which measure the slope of the phase response of the OLTF. For the purely digital loops (A, C), we know the number of cycles we expect to compare the delay to.

At this time, I haven't done the adding up of cycles, zero-order-holds, etc. to get the delay we expect from the analog loop (B), so I've just looked at whether it changed at all. 

Anyways, I've attached the code that analyzes data from DTT-exported text files containing the continuous phase data from the loop measurements. 

Before:

Single Model loop cycles: 1.0000000+/-0.0000006, disparity: -0.00+/-0.25 nsec
2 Model RFM loop cycles: 1.9999999+/-0.0000013, disparity: 0.0+/-0.5 nsec
ADC->DAC loop time: 338.2+/-0.4 usec

After:

Single Model loop cycles: 0.9999999+/-0.0000008, disparity: 0.02+/-0.29 nsec
2 Model RFM loop cycles: 2.0000001+/-0.0000011, disparity: -0.0+/-0.4 nsec
ADC->DAC loop time: 338.18+/-0.35 usec

So, the digital loops take the number of cycles we expect, and there are no real differences after the upgrade. 


Additionally, for all three loops, I created a simple 100:10 filter in foton, and injected broadband noise with awggui, to measure the real TF applied by the FM code. I want to turn this whole process into a single script that will switch the filter on and off, read the foton file, and compare the measured TF to the ideal shape. 

In our system, before and after the upgrade, all three loops showed no appreciable difference from the designed filter shape, other than some tiny uptick in phase when approaching the nyquist frequency. This may be due to the fact that I'm comparing to the ideal analog filter, rather than what a 16kHz digital filter looks like. 

What I've plotted below is the devitation from the ideal zpk(100Hz, 10Hz, 0.1) frequency response, i.e. Hmeasured / Hideal. The code to do this analysis is also attached, it estimates the TF by dividing the CSD of the filter input and output by the PSD of the input. The single worst coherence in any bin of all the measurements is 0.997, so I didn't really bother to estimate the error of the TF estimate. 

Attachment 1: filterShapes.png
filterShapes.png
Attachment 2: CDS_diag_codes.zip
  11461   Wed Jul 29 21:40:39 2015 KojiSummaryCDSStatus of the frame data syncing

The trend data hasn't been synced with LDAS since Jul 27 5AM local.

40m:

controls@nodus|minute > pwd
/frames/trend/minute
controls@nodus|minute > ls -l 11222 | tail
total 590432
-rw-r--r-- 1 controls controls 35758781 Jul 29 11:59 C-M-1122228000-3600.gwf
-rw-r--r-- 1 controls controls 35501472 Jul 29 12:59 C-M-1122231600-3600.gwf
-rw-r--r-- 1 controls controls 35296271 Jul 29 13:59 C-M-1122235200-3600.gwf
-rw-r--r-- 1 controls controls 35459901 Jul 29 14:59 C-M-1122238800-3600.gwf
-rw-r--r-- 1 controls controls 35550346 Jul 29 15:59 C-M-1122242400-3600.gwf
-rw-r--r-- 1 controls controls 35699944 Jul 29 16:59 C-M-1122246000-3600.gwf
-rw-r--r-- 1 controls controls 35549480 Jul 29 17:59 C-M-1122249600-3600.gwf
-rw-r--r-- 1 controls controls 35481070 Jul 29 18:59 C-M-1122253200-3600.gwf
-rw-r--r-- 1 controls controls 35518238 Jul 29 19:59 C-M-1122256800-3600.gwf
-rw-r--r-- 1 controls controls 35514930 Jul 29 20:59 C-M-1122260400-3600.gwf

 

LDAS Minute trend:

[koji.arai@ldas-pcdev3 C-M-11]$ pwd
/archive/frames/trend/minute-trend/40m/C-M-11
[koji.arai@ldas-pcdev3 C-M-11]$ ls -l | tail
-rw-r--r-- 1 1001 1001 35488497 Jul 26 19:59 C-M-1121997600-3600.gwf
-rw-r--r-- 1 1001 1001 35477333 Jul 26 21:00 C-M-1122001200-3600.gwf
-rw-r--r-- 1 1001 1001 35498259 Jul 26 21:59 C-M-1122004800-3600.gwf
-rw-r--r-- 1 1001 1001 35509729 Jul 26 22:59 C-M-1122008400-3600.gwf
-rw-r--r-- 1 1001 1001 35472432 Jul 26 23:59 C-M-1122012000-3600.gwf
-rw-r--r-- 1 1001 1001 35472230 Jul 27 00:59 C-M-1122015600-3600.gwf
-rw-r--r-- 1 1001 1001 35468199 Jul 27 01:59 C-M-1122019200-3600.gwf
-rw-r--r-- 1 1001 1001 35461729 Jul 27 02:59 C-M-1122022800-3600.gwf
-rw-r--r-- 1 1001 1001 35486755 Jul 27 03:59 C-M-1122026400-3600.gwf
-rw-r--r-- 1 1001 1001 35467084 Jul 27 04:59 C-M-1122030000-3600.gwf

 

  11465   Thu Jul 30 11:47:54 2015 KojiSummaryCDSStatus of the frame data syncing

Today it was synced at 5AM but that was all.

40m:

controls@nodus|minute > pwd
/frames/trend/minute
controls@nodus|minute > ls -l 11222|tail
-rw-r--r-- 1 controls controls 35521183 Jul 29 21:59 C-M-1122264000-3600.gwf
-rw-r--r-- 1 controls controls 35509281 Jul 29 22:59 C-M-1122267600-3600.gwf
-rw-r--r-- 1 controls controls 35511705 Jul 29 23:59 C-M-1122271200-3600.gwf
-rw-r--r-- 1 controls controls 35809690 Jul 30 00:59 C-M-1122274800-3600.gwf
-rw-r--r-- 1 controls controls 35752082 Jul 30 01:59 C-M-1122278400-3600.gwf
-rw-r--r-- 1 controls controls 35927246 Jul 30 02:59 C-M-1122282000-3600.gwf
-rw-r--r-- 1 controls controls 35775843 Jul 30 03:59 C-M-1122285600-3600.gwf
-rw-r--r-- 1 controls controls 35648583 Jul 30 04:59 C-M-1122289200-3600.gwf
-rw-r--r-- 1 controls controls 35643898 Jul 30 05:59 C-M-1122292800-3600.gwf
-rw-r--r-- 1 controls controls 35704049 Jul 30 06:59 C-M-1122296400-3600.gwf
controls@nodus|minute > ls -l 11223|tail
total 139616
-rw-r--r-- 1 controls controls 35696854 Jul 30 08:02 C-M-1122300000-3600.gwf
-rw-r--r-- 1 controls controls 35675136 Jul 30 08:59 C-M-1122303600-3600.gwf
-rw-r--r-- 1 controls controls 35701754 Jul 30 09:59 C-M-1122307200-3600.gwf
-rw-r--r-- 1 controls controls 35718038 Jul 30 10:59 C-M-1122310800-3600.gwf

LDAS Minute trend:

[koji.arai@ldas-pcdev3 C-M-11]$ pwd
/archive/frames/trend/minute-trend/40m/C-M-11
[koji.arai@ldas-pcdev3 C-M-11]$ ls -l |tail
-rw-r--r-- 1 1001 1001 35518238 Jul 29 19:59 C-M-1122256800-3600.gwf
-rw-r--r-- 1 1001 1001 35514930 Jul 29 20:59 C-M-1122260400-3600.gwf
-rw-r--r-- 1 1001 1001 35521183 Jul 29 21:59 C-M-1122264000-3600.gwf
-rw-r--r-- 1 1001 1001 35509281 Jul 29 22:59 C-M-1122267600-3600.gwf
-rw-r--r-- 1 1001 1001 35511705 Jul 29 23:59 C-M-1122271200-3600.gwf
-rw-r--r-- 1 1001 1001 35809690 Jul 30 00:59 C-M-1122274800-3600.gwf
-rw-r--r-- 1 1001 1001 35752082 Jul 30 01:59 C-M-1122278400-3600.gwf
-rw-r--r-- 1 1001 1001 35927246 Jul 30 02:59 C-M-1122282000-3600.gwf
-rw-r--r-- 1 1001 1001 35775843 Jul 30 03:59 C-M-1122285600-3600.gwf
-rw-r--r-- 1 1001 1001 35648583 Jul 30 04:59 C-M-1122289200-3600.gwf

  11479   Wed Aug 5 10:56:07 2015 ericqUpdateCDSMany models crashed

Last night around 1AM, many of the the frontend models crashed due to an ADC timeout. (But none of the IOPs, and all the c1lsc models were fine.)

 
First, on c1sus (Wed Aug  5 00:56:46 PDT 2015)
[1502036.695639] c1rfm: ADC TIMEOUT 0 46281 9 46153
[1502036.945259] c1pem: ADC TIMEOUT 0 56631 55 56695
[1502036.965969] c1mcs: ADC TIMEOUT 1 56706 2 56770
[1502036.965971] c1sus: ADC TIMEOUT 1 56706 2 56770

Then, simultaneously on c1ioo, c1iscex, and c1iscey. (Wed Aug  5 01:10:53 PDT 2015)

[1509007.391124] c1ioo: ADC TIMEOUT 0 46329 57 46201
[1509007.702792] c1als: ADC TIMEOUT 1 63128 24 63192

[2448096.252002] c1scx: ADC TIMEOUT 0 46293 21 46165
[2448096.258001] c1asx: ADC TIMEOUT 0 46669 13 46541

[1674945.583003] c1scy: ADC TIMEOUT 0 46297 25 46169
[1674945.685002] c1tst: ADC TIMEOUT 0 52993 1 52865

I'm still working on getting things back up and running. Just restarting models wasn't working, so I'm trying some soft reboots...


UPDATE: A soft reboot of all frontends seems to have worked,

Attachment 1: crashes.png
crashes.png
  11551   Tue Sep 1 02:44:44 2015 KojiSummaryCDSc1oaf, c1mcs modified for the IMC angular FF

[Koji, Ignacio]

In order to allow us to work on the IMC angular FF, we made the signal paths from PEM to MC SUSs.
In fact, there already were the paths from c1pem to c1oaf. So, the new paths were made from c1oaf to c1mcs. (Attachment 1~3)

After some debugging those two models started running. The additional cost of the processing time is insignificant.
FB was restarted to accomodate the change.

Once the modification of the models was completed, the OAF screens were modified. It seemed that the Kissel button
for the output matrix haven't been updated for the PRM ASC implementation. This was fixed as the button was updated this time.
In addition, the button for the FM matrix was also made and pasted.

 

Attachment 1: c1oaf_screenshot1.png
c1oaf_screenshot1.png
Attachment 2: c1oaf_screenshot2.png
c1oaf_screenshot2.png
Attachment 3: c1mcs_screenshot.png
c1mcs_screenshot.png
Attachment 4: OAF_MEDM1.png
OAF_MEDM1.png
Attachment 5: OAF_MEDM2.png
OAF_MEDM2.png
  11564   Thu Sep 3 02:12:08 2015 ranaUpdateCDSSimulink Webview updated

Back in 2011, JoeB wrote some entries on how to automatically update the Simulink webview stuff.

Somehow, the cron broke down over the years. I reran the matlab file by hand today and it worked fine, so now you can see the up to date models using the internet.

https://nodus.ligo.caltech.edu:30889/FE/

  11565   Thu Sep 3 02:30:46 2015 ranaUpdateCDSc1cal time reduced by deleting LSC sensing matrix

I experimented with removing somethings here and there to reduce the c1cal runtime. Eventually I deleted the LSC Sensing Matrix from it.

  • Ever since the upgrade, the c1cal has gone from 60 to 68 usec run time, so its constantly over.
  • Back when Jenne set it up back in Oct 2013, it was running at 39 usec.
  • The purely CAL stuff had some wacko impossible filters in it: please don't try to invert the AA filters making a filter with multiple zeros in it Masayuki.
  • I removed the weird / impossible / unstable filters.
  • I'm guessing that the sensing matrix code had some hand-rolled C-code blocks which are just not very speedy, so we need to rethink how to do the lockin / oscillator stuff so that it doesn't overload the CPU. I bet its somewhere in the weird way the I/Q signals were untangled. My suggestion is to change this stuff to use the standard CDS lockin modules and just record the I/Q stuff. We don't need to try to make magnitude and phase in the front end.

After removing sensing matrix, the run time is now down to 6 usec.

  11567   Thu Sep 3 13:25:40 2015 ranaUpdateCDSSimulink Webview updated

added the cron script for this to megatron to run at 8:44 AM each morning. Here's the new MegaCron attached :-()-

** it takes ~13 minutes to complete on megatron

Attachment 1: crontab_150903.rtf
MAILTO=ericq@caltech.edu

# m h  dom mon dow   command
#0 */1 * * * bash /home/controls/public_html/summary/bin/c1_summary_page.sh > /dev/null 2>&1
#15 5 * * * /ligo/apps/nds2/nds2-megatron/test-restart

# MEDM Screen caps for the webpage
2,13,25,37,49 * * * * /cvs/cds/project/statScreen/bin/cronjob.sh

# op340m transplants -ericq
... 18 more lines ...
  11570   Fri Sep 4 00:58:29 2015 ranaUpdateCDSsoldering the Generic Pentek interface board

Q and Ignacio were taking a second look at the Pentek interface board which we're using to acquire the POP QPD, ALS trans, and MCF/MCL channels. It has a differential intput, two jumper able whitening stages inside and some low pass filtering.

I noticed that each channel has a 1.5 kHz pole associate with each 150:15 whitening stage. It also has 2 2nd order Butterworth low pass at 800 Hz. Also there's a RF filter on the front end. We don't need all that low passing, so I started modifying the filters. Tonight I moved the 800 Hz poles to 8000 Hz. Tomorrow we'll move the others if Steve can find us enough (> 16) 1 nF SMD caps (1206 NPO).

After this those signals ought to have less phase lag and more signal above 1 kHz. Since the ADC is running at 64 kHz, we don't need any analog filtering below 8 kHz.

  11573   Fri Sep 4 08:00:49 2015 IgnacioUpdateCDSRC low pass circuit (1s stage) of Pentek board

Here is the transfer function and cutoff frequency (pole) of the first stage low pass circuit of the Pentek whitening board.

Circuit:

R1 = R2 = 49.9 Ohm, R3 = 50 kOhm, C = 0.01uF. Given a differential voltage of 30 volts, the voltage across the 50k resistor should be 29.93 volts.

Transfer Function: 

Given by, 

H(s) = \frac{1.002\text{e}06}{s+1.002\text{e}06}

So low pass RC filter with one pole at 1 MHz.

I have updated the schematic, up to the changes mentioned by Rana plus some notes, see the DCC link here: [PLACEHOLDER]

I should have done this by hand...crying

Attachment 1: circuit.pdf
circuit.pdf
  11574   Fri Sep 4 09:23:32 2015 IgnacioUpdateCDSModified Pentek schematic

Attached is the modifed Pentek whitening board schematic. It includes the yet to be installed 1nF capacitors  and comments. 

Attachment 1: schematic.pdf
schematic.pdf schematic.pdf schematic.pdf
  11579   Fri Sep 4 20:42:14 2015 gautam, ranaUpdateCDSCheckout of the Wenzel dividers

Some years ago I bought some dividers from Wenzel. For each arm, we have x256 and a x64 divider. Wired in series, that means we can divide each IR beat by 2^14.

The highest frequency we can read in our digital system is ~8100 Hz. This corresponds to an RF frequency of ~132 MHz which as much as the BBPD could go, but less than the fiber PDs.

Today we checked them out:

  1. They run on +15V power.
  2. For low RF frequencies (< 40 MHz) the signal level can be as low as -25 dBm.
  3. For frequencies up to 130 MHz, the signal should be > 0 dBm.
  4. In all cases, we get a square wave going from 0 ~ 2.5 V, so the limiter inside keeps the output amplitude roughly fixed at a high level.
  5. When the RF amplitude goes below the minimum, the output gets shaky and eventually drops to 0 V.

Since this seems promising, we're going to make a box on Monday to package both of these. There will one SMA input and output per channel.

Each channel will have a an amplifier since this need not be a low noise channel. The ZKL-1R5 seems like a good choice to me. G=40 dB and +15 dBm output.

Then Gautam will make a frequency counter module in the RCG which can do counting with square waves and not care about the wiggles in the waveform.

I think this ought to do the trick for our Coarse frequency discriminator. Then our Delay Box ought to be able to have a few MHz range and do all of the Fast ALS Carm that we need.

Attachment 1: TEK00000.PNG
TEK00000.PNG
Attachment 2: TEK00001.PNG
TEK00001.PNG
Attachment 3: TEK00002.PNG
TEK00002.PNG
  11647   Tue Sep 29 03:14:04 2015 gautamUpdateCDSFrequency divider box

Earlier today, the front panels for the 1U chassis I obtained to house the Wenzel dividers + RF amplifiers arrived, which meant that finally I had everything needed to complete the assembly. Pictures of the finished arrangement attached. 

Summary of the arrangement:

  • Two identical channels (RF amplifier + /64 divider + /256 divider), one for each arm
  • The front panels are anodized, and isolated SMA feedthroughs are used 
  • Given the large number of units to be supplied with DC power (2 amplifiers + 4 dividers), I chose to use two D1000217 power regulators (the default configuration takes +-18V as input, and outputs regulated +-15V, which was fine for the dividers, but the ZKL-1R5 requires +12V, so I changed the resistor R2 in the schematic from a 10.7K to 8.451K so as to accommodate this).
  • The amplifiers and dividers are mounted on a steel plate, which is itself mounted on the chassis via insulating posts. 

Testing:

  • I first verified the power regulator circuitry without hooking up the amplifiers/dividers - with a multimeter, I verified I was getting +15V and +12V as expected.
  • I then connected the amplifiers and dividers, and decided to first check the behaviour of each channel using the Fluke 6061 RF function generator and an oscilloscope. One of the channels (X-arm in the current configuration) worked fine - I got a 0-2.5V square wave as the output for input signals as low as -38dBm at 130MHz (consistent with out earlier observations).
  • The Y-arm channel however did not give me any output. In order to debug the problem, I decided to check the output after the amplifier first. The amplifier does not seem to be working for this channel - I get the same amplitude at the output as at the input. I verified that the correct DC power voltage of +12V was being supplied with a multimeter, but I am not sure how to debug this further. The amplifier is basically straight out of the box, and as far as I can tell, I have not done anything to damage it, as this was the first time I am connecting it to anything, and I repeated the same steps on the Y-arm as the X-arm, which seems to work alright.
  • The rest of the Y-arm signal chain was verified to be working by bypassing the amplifier stage (the attached photographs show the box in this configuration. There seems to be no issues with the divider part of the signal chain. 

Once I figure out the problem with this amplifier/replace it, the box is ready to be installed. 

 

Attachment 1: IMG_0014.JPG
IMG_0014.JPG
Attachment 2: IMG_0015.JPG
IMG_0015.JPG
ELOG V3.1.3-