40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 41 of 339  Not logged in ELOG logo
ID Date Authorup Type Category Subject
  6171   Wed Jan 4 16:40:52 2012 JamieUpdateComputersfront-end fb communication restored

Communication between the front end models and the framebuilder has been restored.  I'm not sure exactly what the issue was, but rebuilding the framebuilder daqd executable and restarting seems to have fixed the issue.

I suspect that the problem might have had to do with how I left things after the last attempt to upgrade to RCG 2.4.  Maybe the daqd that was running was linked against some library that I accidentally moved after starting the daqd process.  It would have kept running fine as was, but if the process died and was attempted to be started again, it's broken linking might have kept it from running correctly.  I don't have any other explanation.

It turns out this was not (best I can tell) related to the new year time sync issues that wer seen at the sites.

  6173   Thu Jan 5 09:59:27 2012 JamieUpdateCDSRTS/RCG/DAQ UPGRADE TO COMMENCE

RTS/RCG/DAQ UPGRADE TO COMMENCE

I will be attempting (again) to upgrade the RTS, including the RCG and the daqd, to version 2.4 today.  The RTS will be offline until further notice.

  6174   Thu Jan 5 20:40:21 2012 JamieUpdateCDSRTS upgrade aborted; restored to previous settings; fb symmetricom card failing?

After running into more problems with the upgrade, I eventually decided to abort todays upgrade attempt, and revert back to where we were this morning (RTS 2.1).  I'll try to follow this with a fuller report explaining what problems I encountered when attempting the upgrade.

However, when Alex and I were trying to figure out what was going wrong in the upgrade, it appears that the fb symmetricom card lost the ability to sync with the GPS receiver.  When the symmeticom module is loaded, dmesg shows the following:

[  285.591880] Symmetricom GPS card on bus 6; device 0
[  285.591887] PIC BASE 2 address = fc1ff800
[  285.591924] Remapped 0x17e2800
[  285.591932] Current time 947125171s 94264us 800ns 
[  285.591940] Current time 947125171s 94272us 600ns 
[  285.591947] Current time 947125171s 94280us 200ns 
[  285.591955] Current time 947125171s 94287us 700ns 
[  285.591963] Current time 947125171s 94295us 800ns 
[  285.591970] Current time 947125171s 94303us 300ns 
[  285.591978] Current time 947125171s 94310us 800ns 
[  285.591985] Current time 947125171s 94318us 300ns 
[  285.591993] Current time 947125171s 94325us 800ns 
[  285.592001] Current time 947125171s 94333us 900ns 
[  285.592005] Flywheeling, unlocked...

Because of this, the daqd doesn't get the proper timing signal, and consequently is out of sync with the timing from the models.

It's completely unclear what caused this to happen.  The card seemed to be working all day today, then Alex and I were trying to debug some other(maybe?) timing issues and the symmetricom card all of a sudden stopped syncing to the GPS.  We tried rebooting the frame builder and even tried pulling all the power to the machine, but it never came back up.  We checked the GPS signal itself and to the extend that we know what that signal is supposed to look like it looked ok.

I speculate that this is also the cause of the problems were were seeing earlier in the week.  Maybe the symmetricom card has just been acting flaky, and something we did pushed it over the edge.

Anyway, we will try to replace it tomorrow, but Alex is skeptical that we have a replacement of this same card.  There may be a newer Spectracom card we can use, but there may be problems using it on the old sun hardware that the fb is currently running on.  We'll see.

In the mean time, the daqd is running rogue, off of it's own timing.  Surprisingly all of the models are currently showing 0x0 status, which means no problems.  It doesn't seem to be recording any data, though.  Hopefully we'll get it all sorted out tomorrow.

  6176   Fri Jan 6 11:49:13 2012 JamieUpdateCDSframebuilder taken offline to diagnose problem with symmetricom timing card

Alex and I have taken the framebuilder offline to try to see what's wrong with the symmetricom card.  We have removed the card from the chassis and Alex has taken it back to downs to do some more debugging.

We have been formulating some alternate methods to get timing to the fb in case we can't end up getting the card working.

  6177   Fri Jan 6 14:31:54 2012 JamieUpdateCDSframebuilder back online, using NTP time syncronization

The framebuilder is back online now, minus it's symmetricom GPS card.  The card seems to have failed entirely, and was not able to be made to work at downs either.  It has been entirely removed from fb.

As a fall back, the system has been made to work off of the system NTP-based time synchronization.  The latest symmetricom driver, which is part of the RCG 2.4 branch, will fall back to using local time if the GPS synchronization fails.  The new driver was compiled from our local checkout of the 2.4 source in the new to-be-used-in-the-future rtscore directory:

controls@fb ~ 0$ diff {/opt/rtcds/rtscore/branches/branch-2.4/src/drv/symmetricom,/lib/modules/2.6.34.1/kernel/drivers/symmetricom}/symmetricom.ko
controls@fb ~ 0$ 

The driver was reloaded.   daqd was also linked against the last running stable version and restarted:

controls@fb ~ 0$ ls -al $(readlink -f /opt/rtcds/caltech/c1/target/fb/daqd)
-rwxr-xr-x 1 controls controls 6592694 Dec 15 21:09 /opt/rtcds/caltech/c1/target/fb/daqd.20120104
controls@fb ~ 0$ 

We'll have to keep an eye on the system, to see that it continues to record data properly, and that the fb and the front-ends remain in sync.

The question now is what do we do moving forward.  CDS is not supporting the symmetricom cards anymore, and have moved to using Spectracom GPS/IRIG-B cards.  However, Downs has neither at the moment.  Even if we get a new Spectracom card, it might not work in this older Sun hardware, in which case we might need to consider upgrading the framebuilder to a new machine (one supported by CDS).

  6305   Wed Feb 22 16:55:16 2012 JamieUpdateSUSwacky state of SUS input matrices

While Kiwamu and I were trying to investigate the the vertex glitches we were noticing excess noise in ITMX, which Kiwamu blamed on some sort of bad diagonalization.  Sure enough, the ITMX input matrix is in the default state [0], not a properly diagonalized state.  Looking through the rest of the suspensions, I found PRM also in the default state, not diagonalized.

We should do another round of suspension diagonalization.

Kiwamu (or whoever is here last tonight): please run the free-swing/kick script (/opt/rtcds/caltech/c1/scripts/SUS/freeswing) before you leave, and I'll check the matrices and update the suspensions tomorrow morning.

[0]

0.25 0.25 0.25 0.25 0
1.66 1.66 -1.66 1.66 0
1.66 -1.66 -1.66 1.66 0
0 0 0 0 1

  6467   Thu Mar 29 19:13:56 2012 JamieOmnistructureComputersWireless router for GC

I retrieved the newly "secured" router from Junaid.  It had apparently been hooked up to the GC network via it's LAN port, but it's LAN services had no been shut off.  It was therefore offering a competing DHCP server, which will totally hose a network.  A definite NONO.

The new SSID is "40mWiFi", it's WPA2, and the password is pasted to the bottom of the unit (the unit is back in it's original spot on the office computer rack.

  6495   Fri Apr 6 14:39:21 2012 JamieUpdateComputersRAID array is rebuilding....

The RAID (JetStor SATA 416S) is indeed resyncing itself after a disk failure.  There is a hot spare, so it's stable for the moment.  But we need a replacement disk:

    RAID disks:  1000.2GB Hitachi HDT721010SLA360

Do we have spares?  If not we should probably buy some, if we can.  We want to try to keep a stock of the same model number.

Other notes:

The RAID has a web interface, but it was for some reason not connected.  I connected it to the martian network at 192.168.113.119.

Viewing the RAID event log on the web interface silences the alarm.

I retrieved the manual from Alex, and placed it in the COMPUTER MANUALS drawer in the filing cabinet.

  6501   Fri Apr 6 20:05:12 2012 JamieSummaryGeneralLaser Emergency Shutoff

We reset the interlock and restarted the PSL.  The end AUX lasers seem to have come back online fine.  PMC and mode cleaner locked back up quickly.

  6540   Tue Apr 17 11:05:04 2012 JamieUpdateCDSCDS upgrade in progress

I am continuing to attempt to upgrade the CDS system to RTS 2.5.  Systems will continue to be up and down for the rest of the day.

  6541   Tue Apr 17 19:03:09 2012 JamieUpdateCDSCDS upgrade in progress

Upgrade progresses, but not complete.  There are some relatively minor issues, and one potentially big issue.

All new software has been installed, including the new epics that supports long channel names.

I've been doing a LOT of cleanup.  It was REALLY messy in there.

The new framebuilder/daqd code is running on fb.

Models are compiling with the new RCG and I am able to get them running.  Some of them are not compiling for relatively minor reasons (the simulink models need updating).  I'm also running into compile problems with IOPs that are using the dolphin drivers.

The major issue is that the framebuilder and the models are not syncing their timing, so there's no data collection.  I've spoken to Alex and he and Rolf are going to come over tomorrow to sort it out.  It's possible that we're missing timing hardware that the new code is expecting.

There are still some stability issues I haven't sorted out yet, and I have a lot more cleanup to do.

At this rate I'm going to shoot for being done Thursday.

  6542   Wed Apr 18 08:53:50 2012 JamieUpdateGeneralPower outage last night

Apparently there was a catastrophic power failure last night.  Bob says it took out power in most of Pasadena.

Bob checked the vacuum system when he got in first thing this morning and everything's back up and checks out.  The laster is still off and most of the front-end computers did not recover.

I'm going to start a boot fest now.  I'll be able to report more once everything is back on.

  6543   Wed Apr 18 10:05:40 2012 JamieUpdateGeneralPower outage last night

All of the front-ends are back up and I've been able to recover local control of all of the opitcs (snapshots from saturday). Issues:

  • I can't lock the PMC.  Still unclear why.
  • there are no oplev signals in MC1, MC2, and MC3
  • Something is wrong with PRM.  He is very noisy.  Turning on his oplev servo makes him go crazy.
  • There are all sorts of problems with the oplevs in general.  Many of the optics have no oplev settings.  This is probably not related to the power outage.

On a brighter note, ETMX is damped with it's new RCG 2.5 controller!  yay!

  6546   Wed Apr 18 19:59:48 2012 JamieUpdateCDSCDS upgrade success

The upgrade is nearly complete:

  • new daqd code is running on fb
  • the fe/daqd timing issue was resolved by adjusting the GPS offset in the daqdrc.  I will document this more later.
  • the power outage conveniently rebooted all the front-end machines, so they're all now running new caRepeater
  • all models have been successfully recompiled with RCG 2.5 (with only a couple small glitches)
  • all new models are running on all front-end machines (with a couple exceptions)
  • all suspension models seem to be damping under local control (PRM is having troubles that are likely unrelated to the upgrade).
  • a lot of cleanup has been done

Remaining tasks/issues:

  • more testing OF EVERYTHING needs o be done
  • I did not yet update the DIS dolphin code, so we're running with the old code.  I don't think this is a problem, but it would be nice to get us running what they're running at the sites
  • I tried to cleanup/simplify how front-end initialization is done.  However, there is a problem and models are not auto-starting after reboot.  This needs to be fixed.
  • the userapps directory is in a new place (/opt/rtcds/userapps).  Not everything in the old location was checked into the repository, so we need to check to make sure everything that needs to be is checked in, and that all the models are running the right code.
  • the c1oaf model seems to be having a dolphin issue that needs to be sorted
  • the c1gfd model causes c1ioo to crash immediately upon being loaded.  I have removed it from the rtsystab.  That model needs to be fixed.
  • general model cleanup is in order.
  • more front-end cleanup is needed, particularly in regards to boot-up procedure.
  • document the entire upgrade procedure.

I'll finish up these remaining tasks tomorrow.

  6548   Thu Apr 19 08:43:16 2012 JamieUpdateCDSoaf

Quote:

Edit: Old version (~september) of the code and oaf model is running now. In the 2.1 code there was a link from src/epics/simLink to oaf code for each DOF. It seems that 2.5 version finds models and c codes in standard directories. I need to move working code to the proper directory.

 Yes, things have changed.  Please wait until I have things cleaned up before working on models.  I'll explain what the new setup is.

  6552   Fri Apr 20 19:54:57 2012 JamieUpdateCDSCDS upgrade problems

I ran into a couple of snags today.

A big one is that the framebuilder daqd started going haywire when I told it to start writing frames.  After restart the logs started showing this:

[Fri Apr 20 17:23:40 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:41 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:42 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:43 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:44 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:45 2012] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1019002442 to 1019003041
FATAL: exception not rethrown
FATAL: exception not rethrown
FATAL: exception not rethrown

and the network seemed like it started to get really slow.  I wasn't able to figure out what was going on, so I shut the frame writing off again.  I'll have to work with Rolf on that next week.

Another big problem is the workstation application upgrades.  The NDS protocol version has been incremented, which means that all the NDS client applications have to be upgraded.  The new dataviewer is working fine (on pianosa), but dtt is not:

controls@pianosa:~ 0$ diaggui
diaggui: symbol lookup error: /ligo/apps/linux-x86_64/gds-2.15.1/lib/libligogui.so.0: undefined symbol: _ZN18TGScrollBarElement11ShowMembersER16TMemberInspector
controls@pianosa:~ 127$ 

I don't know what's going on here.  All the library paths are ok.  Hopefully I'll be able to figure this out soon.  The old version of dtt definitely does not work with the new setup.

I might go ahead and upgrade some more of the workstations to Ubuntu in the next couple of days as well, so everything is more on the same page.

I also tried to cleanup the front-end boot process, which has it's own problems (models won't auto-start).  I haven't figured that out yet either.  It really needs to just be completely overhauled.

  6554   Sat Apr 21 17:38:19 2012 JamieUpdateCDSdtt, dataviewer working; problem with trend frames

Quote:

Another big problem is the workstation application upgrades.  The NDS protocol version has been incremented, which means that all the NDS client applications have to be upgraded.  The new dataviewer is working fine (on pianosa), but dtt is not:

controls@pianosa:~ 0$ diaggui
diaggui: symbol lookup error: /ligo/apps/linux-x86_64/gds-2.15.1/lib/libligogui.so.0: undefined symbol: _ZN18TGScrollBarElement11ShowMembersER16TMemberInspector
controls@pianosa:~ 127$

 dtt (diaggui) and dataviewer are now working on pianosa to retrieve realtime data and past data from DQ channels.

Unfortunately it looks like there may be a problem with trend data, though.  If I try to retrieve 1 minute of "full data" with dataviewer for channel C1:SUS-ITMX_SUSPOS_IN1_DQ around GPS 1019089138 everything works fine:

Connecting to NDS Server fb (TCP port 8088)
Connecting.... done
T0=12-04-01-00-17-45; Length=60 (s)
60 seconds of data displayed

but if I specify any trend data (second, minute, etc.) I get the following:

Connecting to NDS Server fb (TCP port 8088)
Connecting.... done
Server error 18: trend data is not available
datasrv: DataWriteTrend failed in daq_send().
T0=12-04-01-00-17-45; Length=60 (s)
No data output.

Alex warned me that this might have happened when I was trying to test the new daqd without first turning off frame writing.

I'm not sure how to check the integrity of the frames, though.  Hopefully they can help sort this out on Monday.

  6555   Sat Apr 21 20:40:28 2012 JamieUpdateCDSframebuilder frame writing working again

Quote:

A big one is that the framebuilder daqd started going haywire when I told it to start writing frames.  After restart the logs started showing this:

[Fri Apr 20 17:23:40 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:41 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:42 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:43 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:44 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:45 2012] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1019002442 to 1019003041
FATAL: exception not rethrown
FATAL: exception not rethrown
FATAL: exception not rethrown

and the network seemed like it started to get really slow.  I wasn't able to figure out what was going on, so I shut the frame writing off again.  I'll have to work with Rolf on that next week.

So the framebuilder/daqd frame writing issue seems to have been a transient one.  Alex said he tried enabling frame writing manually and it worked fine, so I tried re-enabling the frame writing config lines and sure enough it worked fine.  So it's a mystery.  Alex said the "main profiler warning" lines tend to show up when data is backed up in the buffer, although he didn't explain why exactly that would have been the issue here.

daqdrc frame writing directives:

start frame-saver;
sync frame-saver;
start trender;
start trend-frame-saver;
sync trend-frame-saver;
start minute-trend-frame-saver;
sync minute-trend-frame-saver;
start raw_minute_trend_saver;
  6556   Sat Apr 21 21:10:34 2012 JamieUpdateCDStrend frame issue partially resolved

Quote:

Unfortunately it looks like there may be a problem with trend data, though.  If I try to retrieve 1 minute of "full data" with dataviewer for channel C1:SUS-ITMX_SUSPOS_IN1_DQ around GPS 1019089138 everything works fine:

Connecting to NDS Server fb (TCP port 8088)
Connecting.... done
T0=12-04-01-00-17-45; Length=60 (s)
60 seconds of data displayed

but if I specify any trend data (second, minute, etc.) I get the following:

Connecting to NDS Server fb (TCP port 8088)
Connecting.... done
Server error 18: trend data is not available
datasrv: DataWriteTrend failed in daq_send().
T0=12-04-01-00-17-45; Length=60 (s)
No data output.

Alex warned me that this might have happened when I was trying to test the new daqd without first turning off frame writing.

Alex told me that the "trend data is not available" message comes from the "trender" functionality not being enabled in daqd.  After re-enabling it (see #6555) minute trend data was available again.  However, there still seems to be an issue with second trends.  When I try to retrieve second trend data from dataviewer for which minute trend data *is* available I get the following error message:

Connecting to NDS Server fb (TCP port 8088)
Connecting.... done
No data found

read(); errno=9
read(); errno=9
T0=12-04-04-02-14-29; Length=3600 (s)
No data output.

Awaiting more help from Alex...

  6560   Tue Apr 24 14:30:08 2012 JamieUpdateCDSIntroducing: rtcds, a utility for interacting with the CDS RTS/RCG

The new rtcds utility

I have written a new utility for interacting with the CDS RTS/RCG system: "rtcds".  It should be available on all workstations and front-end machines, but certain commands are restricted to run on certain front end machines (build, start, stop, etc.).  Here's the help:

controls@c1lsc ~ 0$ rtcds help
Usage: rtcds <command> [arg]

Available commands:

  build|make SYS      build model
  install SYS         install model
  start SYS|all       start model
  restart SYS|all     restart running model
  stop|kill SYS|all   stop running model
  list                list all models for current host

controls@c1lsc ~ 0$ 

Please use this new utility from now on when interacting with RTS.

I'll be improving and expanding it as we go.   Please let me know if you encounter any problems.

 

  6561   Tue Apr 24 14:35:37 2012 JamieUpdateCDSlimited second trend lookback

Quote:

Alex told me that the "trend data is not available" message comes from the "trender" functionality not being enabled in daqd.  After re-enabling it (see #6555) minute trend data was available again.  However, there still seems to be an issue with second trends.  When I try to retrieve second trend data from dataviewer for which minute trend data *is* available I get the following error message:

Connecting to NDS Server fb (TCP port 8088)
Connecting.... done
No data found

read(); errno=9
read(); errno=9
T0=12-04-04-02-14-29; Length=3600 (s)
No data output.

Awaiting more help from Alex...

It looks like this is actually just a limit of how long we're saving the second trends, which is just not that long.  I'll look into extending the second trend look-back.

  6562   Tue Apr 24 14:55:00 2012 JamieUpdateCDScds code paths (rtscore, userapps) have moved

NOTE: Unless you really care about what's going on under the hood, please ignore this entire post and go here: USE THE NEW RTCDS UTILITY

Quote:

  • the userapps directory is in a new place (/opt/rtcds/userapps).  Not everything in the old location was checked into the repository, so we need to check to make sure everything that needs to be is checked in, and that all the models are running the right code.

 An important new aspect of this upgrade is that we have some new directories and some of the code paths have moved to correspond with the "standard" LIGO CDS filesystem hierarchy:

  • rtscore:      /opt/rtcds/rtscore/release       -->  RTS RCG source code (svn: adcLigoRTS)
  • userapps: /opt/rtcds/userapps/release  -->  CDS userapps source for models, RCG C code, medm screens, scripts, etc (svn: cds_user_apps)
  • rtbuild:       /opt/rtcds/caltech/c1/rtbuild    -->  RCG build directory

All work should be done in the "userapps" directory, and all builds should be done in the build directory.  Some important points:

WARNING: DO NOT MODIFY ANYTHING IN RTSCORE

This is important.  The rtscore directory is now just where the RCG source code is stored.  WE NO LONGER BUILD MODELS IN THE RTS SOURCE  Please use the rtbuild directory instead.

NO MORE MODEL/CODE SYMLINKS

You don't need to link you model or code anywhere to get it to compile.  The RCG now uses proper search paths to source in the RCG_LIB_PATH to find the needed source.  This has been configured by the admin, so as long as you put your code in the right place it should be found.

ALL CODE/MODELS/ETC GO IN USERAPPS

All RTS code is now stored ENTIRELY in the userapps directory (e.g. /opt/rtcds/userapps/release/isc/c1/models/c1lsc.mdl).  This is more-or-less the same as before, except that symlinking is no longer necessary.  I have placed a symlink at the old location for convenience.

BUILD ALL MODELS IN RTSCORE

You can run "make c1lsc" in the rtbuild directory, just as you used to in the old core directory.  However, don't do that.  USE THE NEW RTCDS UTILITY instead.

  6573   Thu Apr 26 16:35:34 2012 JamieSummaryCDSrosalba now running Ubuntu 10.04

This morning I installed Ubuntu 10.04 on rosalba.  This is the same version that is running on pianosa.  The two machines should be identically configured, although rosalba may be missing some apt-getable packages.

  6574   Thu Apr 26 18:15:59 2012 JamieUpdateCDSpossible issue with mx_stream on front ends

I'm noticing what appears to be occasional failures of mx_stream on the front end machines.  It doesn't happen that frequently, but I've noticed it a couple of times already since the upgrade.

The symptom is that the DC Status goes to "0xbad" (red) and the "FE NET" goes red for all models on a given front end.

The solution seems to be restarting mx_stream on the given front end:    sudo  /etc/init.d/mx_stream restart"

There is nothing in the mx_stream log:

 controls@c1sus ~ 0$ cat /opt/rtcds/caltech/c1/target/fb/mx_stream_logs/c1sus.log 
 c1x02
 c1sus
 c1mcs
 c1rfm
 c1pem
 mmapped address is 0x7f43740ec000
 mapped at 0x7f43740ec000
 mmapped address is 0x7f43700ec000
 mapped at 0x7f43700ec000
 mmapped address is 0x7f436c0ec000
 mapped at 0x7f436c0ec000
 mmapped address is 0x7f43680ec000
 mapped at 0x7f43680ec000
 mmapped address is 0x7f43640ec000
 mapped at 0x7f43640ec000
 send len = 263596
 Connection Made

but I do see some funny messages in the front end dmesg:

 [200341.317912] DXH Adapter 0 : Heartbeat alive-check for node=12 failed (cnt=8387 state=0x1 deb=0 val=0).
 [200341.318670] DXH Adapter 0 : Session for node 12 is disabled - Status = 0x5
 [200341.319062] Session callback reason=1 status=5 target_node=12
 [200341.319069] Session callback reason=3 status=0 target_node=12
 [200341.359534] (map_table_check_access:752):my id 1 ->  remote id 2 : entry was valid - is now tentatively valid
 [200341.859584] DXH Adapter 0 : Probe failure for node=12 - disabling session probeStatus=0x40000f02
 [200341.860335] DXH Adapter 0 : Session for node 12 is disabled - Status = 0x3
 [200341.860728] Session callback reason=1 status=3 target_node=12
 [200374.006111] DXH Adapter 0 : Set reachable remote node list.
 [200409.020670] DXH Adapter 0 : Set reachable remote node list.
 [200409.021076] DXH Adapter 0 : Session for node 12 is deleted - Status = 0x0
 [200409.021468] Session callback reason=5 status=0 target_node=12
 [200412.362824] (map_table_insert:648):** successfully inserted **(valid unicast) inst 0 node 1->0 fwd 0 fwd_tp 4 egress 0
 [200418.025994] (map_table_check_access:752):my id 1 ->  remote id 0 : entry was valid - is now invalid
 [200418.025998] (map_table_insert:648):** successfully inserted **(valid unicast) inst 0 node 1->2 fwd 0 fwd_tp 4 egress 0
 [200421.743916] Session callback reason=0 status=0 target_node=12
 [200422.073776] DXH Adapter 0 : Set reachable remote node list.
 [200422.342446] Session callback reason=7 status=0 target_node=12
 [200422.342454] DXH Adapter 0 : Session for node 12 is ok.

I'm awaiting feedback from experts.

 

  6576   Thu Apr 26 20:44:23 2012 JamieUpdateCDSdaq network failure, c1ioo failing to start models

Den tried adding a SINGLE acquire channel to the c1ioo, which for some reason hung c1ioo and took down the entire DAQ network (at least all communication between the front ends and the fb).  We recovered by restarting c1ioo and restarting mx_stream on all the rest of the front-ends

After "recovering", though, c1ioo is failing to load models, or at least it's IOP. Here is the tail of dmesg when trying to start the IOP:

[ 1751.140283] c1x03: Initializing space for daqLib buffers
[ 1751.140284] c1x03: Initializing Network
[ 1751.140285] c1x03: Found 1 frameBuilders on network
[ 1751.250658] CPU 2 is now offline
[ 1751.250657] c1x03: Sync source = 4
[ 1751.250657] c1x03: Waiting for EPICS BURT Restore = 1
[ 1751.310008] c1x03: Waiting for EPICS BURT 0
[ 1751.310008] c1x03: BURT Restore Complete
[ 1751.310008] c1x03: Initialized servo control parameters.
[ 1751.311699] c1x03: DAQ Ex Min/Max = 1 3
[ 1751.311699] c1x03: DAQ XEx Min/Max = 3 53
[ 1751.311733] c1x03: DAQ Tp Min/Max = 10001 10007
[ 1751.311733] c1x03: DAQ XTp Min/Max = 10007 10507
[ 1751.311737] c1x03: DIRECT MEMORY MODE of size 64
[ 1751.311737] c1x03: daqLib DCU_ID = 33
[ 1751.311737] c1x03: Invalid num daq chans = 0
[ 1751.311737] c1x03: DAQ init failed -- exiting

The chan file for this model (/opt/rtcds/caltech/c1/chans/daq/C1X03.ini) looks totally fine, has two un-acquired channels uncommented, and has otherwise not been touched. The C1:FEC-33_MSGDAQ is also reading: "ERROR reading DAQ file!"

I'm at a loss for what is going on. I've tried restarting every CDS process on the machine, restarting the model multiple times, restarting fb, and even restarting the entire c1ioo machine, all to no affect.

  6578   Fri Apr 27 09:00:15 2012 JamieUpdateCDSsus watchdogs?

Why are all the suspension watchdogs tripped?  None of the suspension models are running on c1ioo, so they should be completely unaffected.  Steve, did you find them tripped, or did you shut them off?

In either event they should be safetly turned back on.

  6583   Mon Apr 30 13:58:25 2012 JamieUpdateCDSFrame Builder is down

Quote:

Frame builder is down.  PRM has tripped its watch dogs.  I have reset the watch dog on PRM and turned on the OPLEV. It has damped down.  Unable to check what happened since FB is not responding.

There was an minor earthquake yesterday morning which people could feel a few blocks away.  It could have caused the the PRM to unlock.

Jamie,Rolf,  is it okay or us to restart the FB?  

 If it's down it's alway ok to restart it.  If it doesn't respond or immediately crashes again after restart then it might require some investigation, but it should always be ok to restart it.

  6591   Tue May 1 08:18:50 2012 JamieUpdateCDSFrame Builder is down

Quote:

 

I tried restarting the fb in two different ways.  Neither of them re-established the connection to dtt or epics.

 Please be conscious of what components are doing what.  The problem you were experiencing was not "frame builder down".  It was "dtt not able to connect to frame builder".  Those are potentially completely different things.  If the front end status screens show that the frame builder is fine, then it's probably not the frame builder.

Also "epics" has nothing whatsoever to do with any of this.  That's a completely different set of stuff, unrelated to DTT or the frame builder.

  6622   Tue May 8 09:47:53 2012 JamieUpdateCDSbiquad filter form

Quote:

I wanted to switch the implementation of IIR_FILTER from DIRECT FORM II to BIQUAD form in C1IOO and C1SUS models. I modified RCG file /opt/rtcds/rtscore/release/src/fe/controller.c by adding #define CORE_BIQUAD line:

#ifdef OVERSAMPLE
#define CORE_BIQUAD      
#if defined(CORE_BIQUAD)

 I am really not ok with anyone modifying controller.c.  If we're going to be messing around with that we need to change procedure significantly.  This is the code that runs all the models, and we don't currently have any way to track changes in the code.

Did you change it back?  If not, do so immediately and stop messing with it.  Please consult with us first before embarking on these kinds of severe changes to our code.  This is the kind of shit that other people have done that has bit us in the ass in the past.

Futhermore, there is already a way to enable biquad filters in the new version with out modifying the RCG source.  All you need to do is set biquad=1 in the cdsParameters block for you model.

DO NOT MESS WITH CONTROLLER.C!

  6630   Wed May 9 08:21:42 2012 JamieUpdateCDSNo signals for DTT from SUS

Quote:

 c1iscey is much less happy - neither the IOP nor the scy model are willing to talk to fb.  I might give up on them after another few minutes, and wait for some daytime support, since I wanted to do DRMI stuff tonight.

Yeah, giving up now on c1iscey (Jamie....ideas are welcome).  I can lock just fine, including the Yarm, I just can't save data or see data about ETMY specifically.  But I can see LSC data, so I can lock, and I can now take spectra of corner optics.

 This is the mx_stream issue reported previously.  The symptom is that all models on a single front end loose contact with the frame builder, as opposed to *all* models on all front end loosing contact with the frame builder.  That indicates that the problem is a common fb communication issue on the single front end, and that's all handled with mx_stream.

ssh'ing into c1iscey and running "sudo /etc/init.d/mx_stream restart" fixed the problem.

  6640   Fri May 11 08:07:30 2012 JamieUpdateCDSFB

Quote:

Already for the second time today all computers loose connection to the framebuilder. When I ssh to framebuilder DAQD process was not running. I started it

controls@fb ~ 130$ sudo /sbin/init q

Just to be clear, "init q" does not start the framebuilder.  It just tells the init process to reparse the /etc/inittab.  And since init is supposed to be configured to restart daqd when it dies, it restarted it after the reloading of /etc/inittab.  You and Alex must have forgot to do that after you modified the inittab when you're were trying to fix daqd last week.

daqd is known to crash without reason.  It usually just goes unnoticed because init always restarts it automatically.  But we've known about this problem for a while.

Quote:

But I do not know what causes this problem. May be this is a memory issue. For FB

Mem:   7678472k total,  7598368k used,    80104k free

Practically all memory is used. If more is needed and swap is off, DAQD process may die.

This doesn't really mean anything, since the computer always ends up using all available memory.  It doesn't indicate a lack of memory.  If the machine is really running out of memory you would see lots of ugly messages in dmesg.

  6657   Tue May 22 11:32:02 2012 JamieUpdateCDSMEDM suspension screens using macro

Very nice, Yuta!  Don't forget to commit your changes to the SVN.  I took the liberty of doing that for you.  I also tweaked the file a bit, so we don't have to specify IFO and SYS, since those aren't going to ever change.  So the arguments are now only: OPTIC=MC1,DCU_ID=36.  I updated the sitemap accordingly.

Yuta, if you could go ahead and modify the calls to these screens in other places that would be great.  The WATCHDOG, LSC_OVERVIEW, MC_ALIGN screens are ones that immediately come to mind.

And also feel free to make cool new ones.  We could try to make simplified version of the suspension screens now being used at the sites, which are quite nice.

  6658   Tue May 22 11:45:12 2012 JamieConfigurationCDSPlease remember to commit SVN changes

Hey, folks.  Please remember to commit all changes to the SVN in a timely manor.  If you don't, multiple commits will get lumped together and we won't have a good log of the changes we're making.  You might also end up just loosing all of your work.  SVN COMMIT when you're done!  But please don't commit broken or untested code.

pianosa:release 0> svn status | grep -v '^?'
M       cds/c1/models/c1rfm.mdl
M       sus/c1/models/c1mcs.mdl
M       sus/c1/models/c1scx.mdl
M       sus/c1/models/c1scy.mdl
M       isc/c1/models/c1lsc.mdl
M       isc/c1/models/c1pem.mdl
M       isc/c1/models/c1ioo.mdl
M       isc/c1/models/ADAPT_XFCODE_MCL.c
M       isc/c1/models/c1oaf.mdl
M       isc/c1/models/c1gcv.mdl
M       isc/common/medm/OAF_OVERVIEW.adl
M       isc/common/medm/OAF_DOF_BLRMS.adl
M       isc/common/medm/OAF_OVERVIEW_BAK.adl
M       isc/common/medm/OAF_ADAPTATION_MICH.adl
pianosa:release 0>

  6659   Tue May 22 11:47:43 2012 JamieUpdateCDSMEDM suspension screens using macro

Actually, it looks like we're not quite done here.  All the paths in the SUS_SINGLE screen need to be updated to reflect the move.  We should probably make a macro that points to /opt/rtcds/caltech/c1/screens, and update all the paths accordingly.

  6662   Tue May 22 20:24:06 2012 JamieUpdateComputersrossa is now running Ubuntu 10.04

Now same as pianosa and rosalba.  I'll upgrade allegra on Friday.

  6683   Fri May 25 16:58:54 2012 JamieConfigurationComputers.bashrc for workstations

I have setup a shared .bashrc for all the workstations that is symlinked to the normal location on all machines:

controls@rossa:~ 0$ ls -al /home/controls/.bashrc 
lrwxrwxrwx 1 controls controls 23 2012-05-25 15:37 /home/controls/.bashrc -> /users/controls/.bashrc
controls@rossa:~ 0$ 

This should help simplify maintenance considerably.  Editing that file on one machine will edit it for all.  Just edit this one file!  Don't try to get fancy and add extra files!

I also added a bunch of aliases that had previously been missing.  This should help with some of the problems that people had been having.

NOTE: PLEASE DO NOT CHANGE THE DEFAULT SHELL!  We are using bash, because that's what the sites are now using and we want to be as compatible as possible.

You can of course still write scripts in csh/tcsh or use tcsh in a shell if you wish.   Just don't change the default shell for the controls user.

  6684   Fri May 25 17:50:38 2012 JamieUpdateComputersASS scripts on new ubuntu machines

Quote:

[Den, Yuta]

Background:
 ASS and many other scripts don't work on new ubuntu machines.

What we did:
1. Installed C-shell on rossa and rosalba(Ubuntu machine).
  sudo apt-get insall csh

2. Find out that
  /opt/rtcds/caltech/c1/scripts/AutoDither/alignY

runs, but
  /opt/rtcds/caltech/c1/scripts/medmrun /opt/rtcds/caltech/c1/scripts/AutoDither/alignY

doesn't run. It gives us the following error messages.

ezcawrite: error while loading shared libraries: libca.so: cannot open shared object file: No such file or directory
ezcaswitch: error while loading shared libraries: libca.so: cannot open shared object file: No such file or directory

Result:
 ASS scripts run on rossa and rosalba, but not with medmrun.
 At least ASS scripts run on pianosa(ubuntu machine) with medmrun. So we decided to wait for JAMIE to fix it.

Apparently the environment was not being properly inherited by the scripts launched from medmrun.  We modified the medmrum script so that it executes things with an interactive shell ("bash -i -c ...") and this fixed the problem (by assuring that it sources all the interactive environment configs (i.e. ~/.bashrc)).  I'm still not sure why we were seeing different behavior on pianosa, but at least the solution we have now should be robust.

As a reminder, all scripts launched from MEDM should use medmrun:

/opt/rtcds/caltech/c1/scripts/medmrun
  6685   Fri May 25 17:52:08 2012 JamieUpdateComputersallegra now running Ubuntu 10.04

The last of the control room machines is now upgraded.

  6703   Tue May 29 15:29:16 2012 JamieUpdateComputerslatest pynds installed on all new control room machines

The DASWG lscsoft package repositories have a lot of useful analysis software.  It is all maintained for Debian "sqeeze", but it's mostly installable without modification on Ubuntu 10.04 "lucid" (which is based on Debian squeeze).  Basically the only thing that needs to access the lscsoft repositories is to add the following repository file:

controls@rossa:~ 0$ cat /etc/apt/sources.list.d/lscsoft.list 
deb http://www.lsc-group.phys.uwm.edu/daswg/download/software/debian/ squeeze contrib
deb-src http://www.lsc-group.phys.uwm.edu/daswg/download/software/debian/ squeeze contrib

deb http://www.lsc-group.phys.uwm.edu/daswg/download/software/debian/ squeeze-proposed contrib
deb-src http://www.lsc-group.phys.uwm.edu/daswg/download/software/debian/ squeeze-proposed contrib
controls@rossa:~ 0$ 

A simple "apt-get update" then makes all the lscsoft packages available.

lscsoft includes the nds2 client packages (nds2-client-lib) and pynds (python-pynds).  Unfortunately the python-pynds debian squeeze package currently depends on libboost-python1.42, which is not available in Ubuntu lucid.  Fortunately, pynds itself does not require the latest version and can use what's in lucid.  I therefore rebuilt the pynds package on one of the control room machines:

$ apt-get install dpkg-dev devscripts debhelper            # these are packages needed to build a debian/ubuntu package
$ apt-get source python-pynds                              # this downloads the source of the package, and prepares it for a package build
$ cd python-pynds-0.7
$ debuild -uc -us                                          # this actually builds the package
$ ls -al ../python-pynds_0.7-lscsoft1+squeeze1_amd64.deb
-rw-r--r-- 1 controls controls 69210 2012-05-29 11:57 python-pynds_0.7-lscsoft1+squeeze1_amd64.deb

I then copied the package into a common place:

/ligo/apps/debs/python-pynds_0.7-lscsoft1+squeeze1_amd64.deb

I then installed it on all the control room machines as such:

$ sudo apt-get install libboost-python1.40.0 nds2-client-lib python-numpy   # these are the dependencies of python-pynds
$ sudo dpkg -i /ligo/apps/debs/python-pynds_0.7-lscsoft1+squeeze1_amd64.deb

I did this on all the control room machines.

It looks like the next version of pynds won't require us to jump through these extra hoops and should "just work".

  6716   Wed May 30 18:08:40 2012 JamieUpdateLSCc1lsc: add error point pick-offs, moved ctrl pick-offs after feedforward

I made some modifications to the c1lsc model in order to extract both the error and control signals.

I added pick-offs for the error signals right before IFO DOF filter modules.  These are then sent with GOTOs to outputs.

I also modified things on the control side.  The OAF stuff was picking off control signals before feedforward in/outs.  After discussing with Jenne we decided that it would make sense for the OAF to be looking at the control signals after feedforward.  It also makes sense to define the control signal after the feedforward.  These control signals are then sent with GOTOs to another set of outputs.

Finally, I moved the triggers to after the control signal pickoffs, and right before the output matrix.  The final chain looks like (see attachment):

input matrix --> power norm --> ERR pickoff --> DOF filters --> FF out --> FF in --> CTRL pickoff --> trigger --> output matrix

The error pickoff outputs in the top level of the model are left terminated for the moment.  Eventually I will be hooking these into the new c1cal calibration model.

The model was recompiled, installed, and restarted.  Everything came up fine.

Attachment 1: LSCchain.png
LSCchain.png
  6717   Wed May 30 18:16:44 2012 JamieUpdateLSCskeleton of new c1cal calibration model created

[Jamie, Xavi Siemens, Chris Pankow]

We built the skeleton of a new calibration model for the LSC degrees of freedom.  I named it "c1cal".  It will run on the c1lsc FE machine, in CPU slot 4, and has been given DCUID 50.

Right now there's not much in the model, just inputs for DARM_ERR and DARM_CTRL, filters for each input, and the sum of the two channels that is h(t).

Tomorrow we'll extract all the needed signals from c1lsc, and see if we can generate something resembling a calibrated signal for one of the IFO DOFs.

 

  6722   Thu May 31 00:56:13 2012 JamieMetaphysicsComputersPlease remember to check in code changes

I know it's really hard to remember, but our future selves will thank us dearly if we remember to commit all of our code changes to the svn with nice log messages.  At the moment there's a LOT of modified stuff in the userapps working directory that needs to be committed:

controls@pianosa:/opt/rtcds/userapps/release 0$ svn status | grep '^M'
M       cds/c1/models/c1rfm.mdl
M       sus/c1/medm/templates/SUS_SINGLE.adl
M       sus/c1/models/c1mcs.mdl
M       sus/c1/models/c1sus.mdl
M       sus/c1/models/c1scx.mdl
M       sus/c1/models/c1scy.mdl
M       isc/c1/models/c1pem.mdl
M       isc/c1/models/c1ioo.mdl
M       isc/c1/models/ADAPT_XFCODE_MCL.c
M       isc/c1/models/c1oaf.mdl
M       isc/c1/models/c1gcv.mdl
M       isc/common/medm/OAF_OVERVIEW.adl
M       isc/common/medm/OAF_DOF_BLRMS.adl
M       isc/common/medm/OAF_OVERVIEW_BAK.adl
M       isc/common/medm/OAF_ADAPTATION_MICH.adl
controls@pianosa:/opt/rtcds/userapps/release 0$ 

This doesn't even include things that haven't even been added yet.  It doesn't take much time.  Just copy and paste what you elog about the changes.

  6728   Thu May 31 10:31:19 2012 JamieUpdateIOOMC beam spot oscillation

Quote:

This is a common occurrence when diagnostic scripts are written without the ability to handle exceptions (e.g. ctrl-c, terminal gets closed, etc.).

The first thing to do is make sure that the "new" script you are writing doesn't already exist (hint: look in the old scripts directory).

If you are writing a script that touches things in the interferometer, it must always return the settings to the initial state on abnormal termination:

http://linuxdevcenter.com/pub/a/linux/lpt/44_12.html

This is very good advice.  However, "trap" is bash-specific.  tcsh has a different method that uses a function called "onint".  Here's a description of the difference.

A couple notes about bash traps:

  • You can give a name instead of a number for the signal.  So instead of trap 'do stuff' 1 you can say trap 'do stuff' SIGHUP
  • The easiest signal to use is EXIT, which covers all your bases (ie. anything that would cause the script to exit prematurely.
  • You can define a function that gets executed in the trap

So the easiest way to use it is something like the following:

#!/bin/bash   # define cleanup function  function cleanup {      # do cleanup stuff, like reset EPICS records to defaults      ....  }  # set the trap on EXIT  trap cleanup EXIT  # the rest of your script below here
...
  6734   Thu May 31 22:13:08 2012 JamieUpdateCDSc1lsc: added remaining SHMEM senders for ERR and CTRL, c1oaf model updated appropriately

All the ERR and CTRL outputs in c1lsc now go to SHMEM senders.  I renamed the the CTRL output SHMEM senders to be more generic, since they aren't specifically for OAF anymore.  See attached image from c1lsc.

c1oaf was updated so that SHMEM receivers pointed to the newly renamed senders.

c1lsc and c1oaf were rebuilt, installed, and restarted and are now running.

Attachment 1: lsc-shmem-out.png
lsc-shmem-out.png
  6740   Fri Jun 1 09:50:50 2012 JamieUpdateComputersc1sus and c1iscex - bad fb connections

Quote:

Something bad happened to c1sus and c1iscex ~20 min ago.  They both have "0x2bad" 's.  I restarted the daqd on the framebuilder, and then rebooted c1sus, and nothing changed.  The SUS screens are all zeros (the gains seem to be set correctly, but all of the signals are 0's).

If it's not fixed when I get in tomorrow, I'll keep poking at it to make it better.

 This is at least partially related to the mx_stream issue I reported previously.  I restarted mx_stream on c1iscex and that cleared up the models on that machine.

Something else is happening with c1sus.  Restarting mx_stream on c1sus didn't help.  I'll try to fix it when I get over there later.

  6742   Fri Jun 1 14:40:24 2012 JamieUpdateComputersc1sus and c1iscex - bad fb connections

Quote:

This is at least partially related to the mx_stream issue I reported previously.  I restarted mx_stream on c1iscex and that cleared up the models on that machine.

Something else is happening with c1sus.  Restarting mx_stream on c1sus didn't help.  I'll try to fix it when I get over there later.

I managed to recover c1sus.  It required stopping all the models, and the restarting them one-by-one:

$ rtcds stop all     # <-- this does the right to stop all the models with the IOP stopped last, so they will all unload properly.

$ rtcds start iop

$ rtcds start c1sus c1mcs c1rfm

I have no idea why the c1sus models got wedged, or why restarting them in this way fixed the issue.

  6743   Fri Jun 1 14:56:08 2012 JamieUpdateSUSOplevs all different, messed up

For some reason the state of the oplevs is completely different for almost every suspension.  They have different sets of filters in the bank, and different filters engaged.  wtf?  How did this happen?  Is this correct?  Do we expect that the state of the oplevs should be different on all the different suspensions?  I wouldn't have thought so.

I discovered this because the PRM is unstable with the oplevs engaged.  I don't think it was yesterday.  Is something hidden changing the oplev settings?

Attachment 1: oplevs.png
oplevs.png
  6755   Tue Jun 5 14:47:28 2012 JamieUpdateCDSnew c1tst model for testing RCG code

I made a new model, c1tst, that we can use for debugging the FREQUENT RCG bugs that we keep encountering.  It's a bare model that runs on c1iscey.  Don't do any thing important in here, and don't leave it in some crappy state.  Clean if up when you're done.

  6768   Wed Jun 6 18:04:22 2012 JamieUpdateComputer Scripts / Programshacked ezca tools

Quote:

Currently, ezca tools are flakey and fails too much.
So, I hacked ezca tools just like Yoichi did in 2009 (see elog #1368).

For now,

/ligo/apps/linux-x86_64/gds-2.15.1/bin/ezcaread
/ligo/apps/linux-x86_64/gds-2.15.1/bin/ezcastep
/ligo/apps/linux-x86_64/gds-2.15.1/bin/ezcaswitch
/ligo/apps/linux-x86_64/gds-2.15.1/bin/ezcawrite

are wrapper scripts that repeats ezca stuff until it succeeds (or fails more than 5 times).

Of course, this is just a temporary solution to do tonight's work.
To stop this hack, run /users/yuta/scripts/ezhack/stophacking.cmd. To hack, run /users/yuta/scripts/ezhack/starthacking.cmd.

Original binary files are located in /ligo/apps/linux-x86_64/gds-2.15.1/bin/ezcabackup/ directory.
Wrapper scripts live in /users/yuta/scripts/ezhack directory.

I wish I could alias ezca tools to my wrapper scripts so that I don't have to touch the original files. However, alias settings doesn't work in our scripts.
Do you have any idea?

I didn't like this solution, so I hacked up something else.  I made a new single wrapper script to handle all of the utils.  It then executes the correct command based on the zeroth argument (see below).

I think moved all the binaries to give them .bin suffixes, and the made links to the new wrapper script.  Now everything should work as expected, with this new retry feature.

controls@rosalba:/ligo/apps/linux-x86_64/gds-2.15.1/bin 0$ for pgm in ezcaread ezcawrite ezcaservo ezcastep ezcaswitch; do mv $pgm{,.bin}; ln ezcawrapper $pgm; done
controls@rosalba:/ligo/apps/linux-x86_64/gds-2.15.1/bin 0$ cat ezcawrapper
#!/bin/bash

retries=5

pgm="$0"
run="${pgm}.bin"

if ! [ -e "$run" ] ; then
    cat <&2
This is the ezca wrapper script.  It should be hardlinked in place of
the ezca commands (ezcaread, ezcawrite, etc.), and executing the
original binaries (that have been moved to *.bin) with $retries
failure retries.
EOF
    exit -1
fi

if [ -z "$@" ] || [[ "$1" == '-h' ]] ; then
    "$run"
    exit
fi

for try in $(seq 1 "$retries") ; do
    if "$run" "$@"; then
	exit
    else
	echo "retrying ($try/$retries)..." >&2
    fi
done
echo "$(basename $pgm) failed after $retries retries." >&2
exit 1

  6769   Wed Jun 6 18:22:52 2012 JamieUpdateComputer Scripts / Programshacked ezca tools

Quote:

I didn't like this solution, so I hacked up something else.  I made a new single wrapper script to handle all of the utils.  It then executes the correct command based on the zeroth argument (see below).

I think moved all the binaries to give them .bin suffixes, and the made links to the new wrapper script.  Now everything should work as expected, with this new retry feature.

Yuta and I added a feature such that it will not retry if the environment variables EZCA_NORETRY is set, e.g.

$ EZCA_NORETRY=true ezcaread FOOBAR

ELOG V3.1.3-