40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 44 of 344  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  6634   Wed May 9 14:32:31 2012 JenneUpdateCDSBurt restored

Den and Alex left things not-burt restored, and Den mentioned to me that it might need doing.

I burt restored all of our epics.snaps to the 1am today snapshot.  We lost a few hours of striptool trends on the projector, but now they're back (things like the BLRMS don't work if the filters aren't engaged on the PEM model, so it makes sense).

  6635   Wed May 9 15:02:50 2012 DenUpdateCDSRFM

Quote:

However, this did not help, C1RFM did not start. I decided to restart all models on C1SUS machine in hope that C1RFM uses some other models and can't connect to them but this suspended C1SUS machine.

 This happened because of the code bug -

// If PCIE comms show errors, may want to add this cache flushing
#if 1
if(ipcInfo[ii].netType == IPCIE)
          clflush_cache_range (&(ipcInfo[ii].pIpcData->dBlock[sendBlock][ipcIndex].data), 16); // & was missing - Alex fixed this
#endif
 

After this bug was fixed and the code was recompiled, C1:OAF_MCL_IN is OK, no errors occur during the transmission C1:OAF-MCL_ERR=0.

So the problem was in the PCIE card that could not send such amount of data and the last channel (MCL is the last) was corrupted. Now, when Alex added cache flushing, the problem is fixed.

We should spend some more attention to such problems. This time 2046 out of 2048 points were lost per second. But what if 10-20 points are lost, we would not notice that in the dataviewer, but this will cause problems.

  6639   Thu May 10 22:05:21 2012 DenUpdateCDSFB

Already for the second time today all computers loose connection to the framebuilder. When I ssh to framebuilder DAQD process was not running. I started it

controls@fb ~ 130$ sudo /sbin/init q

But I do not know what causes this problem. May be this is a memory issue. For FB

Mem:   7678472k total,  7598368k used,    80104k free

Practically all memory is used. If more is needed and swap is off, DAQD process may die.

  6640   Fri May 11 08:07:30 2012 JamieUpdateCDSFB

Quote:

Already for the second time today all computers loose connection to the framebuilder. When I ssh to framebuilder DAQD process was not running. I started it

controls@fb ~ 130$ sudo /sbin/init q

Just to be clear, "init q" does not start the framebuilder.  It just tells the init process to reparse the /etc/inittab.  And since init is supposed to be configured to restart daqd when it dies, it restarted it after the reloading of /etc/inittab.  You and Alex must have forgot to do that after you modified the inittab when you're were trying to fix daqd last week.

daqd is known to crash without reason.  It usually just goes unnoticed because init always restarts it automatically.  But we've known about this problem for a while.

Quote:

But I do not know what causes this problem. May be this is a memory issue. For FB

Mem:   7678472k total,  7598368k used,    80104k free

Practically all memory is used. If more is needed and swap is off, DAQD process may die.

This doesn't really mean anything, since the computer always ends up using all available memory.  It doesn't indicate a lack of memory.  If the machine is really running out of memory you would see lots of ugly messages in dmesg.

  6654   Mon May 21 21:27:39 2012 yutaUpdateCDSMEDM suspension screens using macro

Background:
 We need more organized MEDM screens. Let's use macro.

What I did:
1. Edited /opt/rtcds/userapps/trunk/sus/c1/medm/templates/SUS_SINGLE.adl using replacements below;

sed -i s/#IFO#SUS_#PART_NAME#/'$(IFO)$(SYS)_$(OPTIC)'/g SUS_SINGLE.adl
sed -i s/#IFO#SUS#_#PART_NAME#/'$(IFO)$(SYS)_$(OPTIC)'/g SUS_SINGLE.adl
sed -i s/#IFO#:FEC-#DCU_ID#/'$(IFO):FEC-$(DCU_ID)'/g SUS_SINGLE.adl
sed -i s/#CHANNEL#/'$(IFO):$(SYS)-$(OPTIC)'/g SUS_SINGLE.adl
sed -i s/#PART_NAME#/'$(OPTIC)'/g SUS_SINGLE.adl

2. Edited sitemap.adl so that they open SUS_SINGLE.adl with arguments like
    IFO=C1,SYS=SUS,OPTIC=MC1,DCU_ID=36
instead of opening ./c1mcs/C1SUS_MC1.adl.

3. I also fixed white blocks in the LOCKIN part.

Result:
  Now you don't have to generate every suspension screens. Just edit SUS_SIGNLE.adl.

Things to do:
 - fix every other MEDM screens which open suspension screens, so that they open SUS_SINGLE.adl
 - make SUS_SINGLE.adl more cool

  6655   Tue May 22 00:23:45 2012 DenUpdateCDStransmission error monitor

I've started to create channels and an medm screen to monitor the errors that occur during the transmission through the RFM model. The screen will show the amount of lost data per second for each channel.

Not all channels are ready yet. For created channels, number of errors is 0, this is good.

 Screenshot.png

  6657   Tue May 22 11:32:02 2012 JamieUpdateCDSMEDM suspension screens using macro

Very nice, Yuta!  Don't forget to commit your changes to the SVN.  I took the liberty of doing that for you.  I also tweaked the file a bit, so we don't have to specify IFO and SYS, since those aren't going to ever change.  So the arguments are now only: OPTIC=MC1,DCU_ID=36.  I updated the sitemap accordingly.

Yuta, if you could go ahead and modify the calls to these screens in other places that would be great.  The WATCHDOG, LSC_OVERVIEW, MC_ALIGN screens are ones that immediately come to mind.

And also feel free to make cool new ones.  We could try to make simplified version of the suspension screens now being used at the sites, which are quite nice.

  6658   Tue May 22 11:45:12 2012 JamieConfigurationCDSPlease remember to commit SVN changes

Hey, folks.  Please remember to commit all changes to the SVN in a timely manor.  If you don't, multiple commits will get lumped together and we won't have a good log of the changes we're making.  You might also end up just loosing all of your work.  SVN COMMIT when you're done!  But please don't commit broken or untested code.

pianosa:release 0> svn status | grep -v '^?'
M       cds/c1/models/c1rfm.mdl
M       sus/c1/models/c1mcs.mdl
M       sus/c1/models/c1scx.mdl
M       sus/c1/models/c1scy.mdl
M       isc/c1/models/c1lsc.mdl
M       isc/c1/models/c1pem.mdl
M       isc/c1/models/c1ioo.mdl
M       isc/c1/models/ADAPT_XFCODE_MCL.c
M       isc/c1/models/c1oaf.mdl
M       isc/c1/models/c1gcv.mdl
M       isc/common/medm/OAF_OVERVIEW.adl
M       isc/common/medm/OAF_DOF_BLRMS.adl
M       isc/common/medm/OAF_OVERVIEW_BAK.adl
M       isc/common/medm/OAF_ADAPTATION_MICH.adl
pianosa:release 0>

  6659   Tue May 22 11:47:43 2012 JamieUpdateCDSMEDM suspension screens using macro

Actually, it looks like we're not quite done here.  All the paths in the SUS_SINGLE screen need to be updated to reflect the move.  We should probably make a macro that points to /opt/rtcds/caltech/c1/screens, and update all the paths accordingly.

  6661   Tue May 22 20:01:26 2012 DenUpdateCDSerror monitor

I've created transmission error monitors in rfm, oaf, sus, lsc, scx, scy and ioo models. I tried to get data from every channel transmitted through PCIE and RFM. I also included some shared memory channels.

The medm screen is in the EF STATUS -> TEM. It shows 16384 for the channels that come from simulation plant. Others are 0, that's fine.

  6663   Tue May 22 20:46:38 2012 yutaUpdateCDSMEDM suspension screens using macro

I fixed the problem Jamie pointed out in elog #6657 and #6659.

What I did:
1. Created the following template files in /opt/rtcds/userapps/trunk/sus/c1/medm/templates/ directry.

SUS_SINGLE_LOCKIN1.adl
SUS_SINGLE_LOCKIN2.adl
SUS_SINGLE_LOCKIN_INMTRX.adl
SUS_SINGLE_OPTLEV_SERVO.adl
SUS_SINGLE_PITCH.adl
SUS_SINGLE_POSITION.adl
SUS_SINGLE_SUSSIDE.adl
SUS_SINGLE_TO_COIL_MASTER.adl
SUS_SINGLE_COIL.adl
SUS_SINGLE_YAW.adl
SUS_SINGLE_INMATRIX_MASTER.adl
SUS_SINGLE_INPUT.adl
SUS_SINGLE_TO_COIL_X_X.adl
SUS_SINGLE_OPTLEV_IN.adl
SUS_SINGLE_OLMATRIX_MASTER.adl

To open these files, you have to define $(OPTIC) and $(DCU_ID).
For SUS_SINGLE_TO_COIL_X_X.adl, you also have to define $(FILTER_NUMBER), too. See SUS_SINGLE_TO_COIL_MASTER.adl.

2. Fixed the following screens so that they open SUS_SINGLE.adl.

C1SUS_WATCHDOGS.adl
C1IOO_MC_ALIGN.adl
C1IOO_WFS_MASTER.adl
C1IFO_ALIGN.adl

  6670   Thu May 24 01:17:13 2012 DenUpdateCDSPMC autolocker

Quote:

 

  • SCRIPT
    • Auto-locker for PMC, PSL things - DEN

 I wrote auto-locker for PMC. It is called autolocker_pmc, located in the scripts directory, svn commited. I connected it to the channel C1:PSL-PMC_LOCK.  It is currently running on rosalba. MC autolocker runs on op340m, but I could not execute the script on that machine

op340m:scripts>./autolock_pmc
./autolock_pmc: Stale NFS file handle.

I did several tests, usually, the script locks PMC is a few seconds. However, if PMC DC output has been drift significantly, if might take longer as discussed below.

The algorithm:

       if autolocker if enabled, monitor PSL-PMC_PMCTRANSPD channel
       if TRANS is less then 0.4, start locking:
               disengage PMC servo by enabling PMC TEST 1
               change PSL-PMC_RAMP unless TRANS is higher then 0.4 (*)
               engage PMC servo by disabling PMC TEST 1
       else sleep for 1 sec
 

(*) is tricky. If RAMP (DC offset) is specified then TRANS will be oscillating in the range ( TRANS_MIN, TRANS_MAX ). We are interested only in the TRANS_MAX. To make sure, we estimate it right, TRANS channel is read 10 times and the maximum value is chosen. This works good.

Next problem is to find the proper range and step to vary DC offset RAMP. Of coarse, we can choose the maximum range (-7, 0) and minimum step 0.0001, but it will take too long to find the proper DC offset. For that reason autolocker tries to find a resonance close to the previous DC offset in the range (RAMP_OLD - delta, RAMP_OLD + delta), initial delta is 0.03 and step is 0.003. It resonance is not found in this region, then delta is multiplied by a factor of 2 and so on. During this process RAMP range is controlled not to be wider then (-7, 0).

The might be a better way to do this. For example, use the gradient descent algorithm and control the step adaptively. I'll do that if this realization will be too slow.

I've disabled autolocker_pmc for the night.

  6699   Tue May 29 00:53:57 2012 DenUpdateCDSproblems

I've noticed several CDS problems:

  1. Communication sign on C1SUS model turns to red once in a while. I press diag reset and it is gone. But after some time comes back.
  2. On C1LSC machine red "U" lamp shines with a period ~5 sec.
  3. I was not able to read data from the SR785 using netgpibdata.py. Either connection is not established at all, or data starts to download and then stops in the middle. I've checked the cables, power supplies and everything, still the same thing.
  6734   Thu May 31 22:13:08 2012 JamieUpdateCDSc1lsc: added remaining SHMEM senders for ERR and CTRL, c1oaf model updated appropriately

All the ERR and CTRL outputs in c1lsc now go to SHMEM senders.  I renamed the the CTRL output SHMEM senders to be more generic, since they aren't specifically for OAF anymore.  See attached image from c1lsc.

c1oaf was updated so that SHMEM receivers pointed to the newly renamed senders.

c1lsc and c1oaf were rebuilt, installed, and restarted and are now running.

Attachment 1: lsc-shmem-out.png
lsc-shmem-out.png
  6748   Sun Jun 3 23:50:00 2012 DenUpdateCDSbiquad=1

From now all models calculate iir filters using biquad form. I've added biquad=1 to cdsParameters to all models except c1cal, built, installed and restarted them.

  6755   Tue Jun 5 14:47:28 2012 JamieUpdateCDSnew c1tst model for testing RCG code

I made a new model, c1tst, that we can use for debugging the FREQUENT RCG bugs that we keep encountering.  It's a bare model that runs on c1iscey.  Don't do any thing important in here, and don't leave it in some crappy state.  Clean if up when you're done.

  6760   Wed Jun 6 00:32:22 2012 JenneUpdateCDSRFM model is way overloading the cpu

We have too much crap in the rfm model.  CPU time for the rfm model is regularly above 60us, and sometimes in the mid-70's (but sometimes jumps down briefly to ~47us, which is where I think it "used" to sit, but I don't remember when I last thought about that number)

This is potentially causing lots of asynchronous grief.

  6778   Thu Jun 7 03:37:26 2012 yutaUpdateCDSmx_stream restarted on c1lsc, c1ioo

c1lsc and c1ioo computers had FB net statuses all red. So, I restarted mx_stream on each computer.

ssh controls@c1lsc
sudo /etc/init.d/mx_stream restart

  6787   Thu Jun 7 17:49:09 2012 JamieUpdateCDSc1sus in weird state, running models but unresponsive otherwise

Somehow c1sus was in a very strange state.  It was running models, but EPICS was slow to respond.  We could not log into it via ssh, and we could not bring up test points.  Since we didn't know what else to do we just gave it a hard reset.

Once it came it, none of the models were running.  I think this is a separate problem with the model startup scripts that I need to debug.  I logged on to c1sus and ran:

rtcds restart all

(which handles proper order of restarts) and everything came up fine.

Have no idea what happened there to make c1sus freeze like that.  Will keep an eye out.

  6806   Tue Jun 12 17:29:28 2012 DenUpdateCDSdq channels

All PEM and IOO DQ channels disappeared. These channels were commented in C1???.ini files though I've uncommented them a few weeks ago. It happened after these models were rebuild, C1???.ini files also changed. Why?

I added the channels back. mx_stream died on c1sus after I pressed DAQ reload on medm screen. For IOO model it is even worse. After pressing DAQ Reload for C1IOO model DACQ process dies on the FB and IOO machine suspends.

I rebooted IOO, restarted models and fb. Models work now, but there might be an easier way to add channels without rebooting machines and demons.

  6911   Wed Jul 4 17:33:04 2012 JamieUpdateCDStiming, possibly leap second, brought down CDS

I got a call from Koji and Yuta that something was wrong with the CDS system.  I somehow had an immediate suspicion that it had something to do with the recent leap second.

It took a while for nodus to respond, and once he finally let me in I found a bunch of the following in his dmesg, repeated and filling the buffer:

Jul  3 22:41:34 nodus xntpd[306]: [ID 774427 daemon.notice] time reset (step) 0.998366 s
Jul  3 22:46:20 nodus xntpd[306]: [ID 774427 daemon.notice] time reset (step) -1.000847 s

Looking at date on all the front end systems, including fb, I could tell that they all looked a second fast, which is what you would expect if they had missed the leap second.  Everything syncs against nodus, so given nodus's problems above, that might explain everything.

I stopped daqd and nds on fb, and unloaded the mx drivers, which seemed to be showing problems.  I also stopped nodus's xntp:

  sudo /etc/init.d/xntpd stop

His ntp config file is in /etc/inet/ntp.conf, which is definitely the WRONG PLACE, given that the ntp server is not, as far as I can tell, being controlled by inetd.  (nodus is WAY out of date and desperately needs an overhaul.  it's nearly impossible to figure out what the hell is going on in there).  I found an old elog of Rana's that mentioned updating his config to point him to the caltech NTP server, which is now listed in the config, so I tried manually resyncing against that:

  sudo ntpdate -s -b -u 131.215.239.14

Unfortunately that didn't seem to have any effect.  This was making me wonder if the caltech server is off?  Anyway, I tried resyncing against the global NTP pool:

  sudo ntpdate -s -b -u pool.ntp.org

This seemed to work: the clock came back in sync with others that are known good.  Once nodus time was good I reloaded the mx drivers on fb and restarted daqd and nds.  They seemed come up fine.  At this point front ends started coming back on their own.  I went and restarted all the models on the machines that didn't (c1iscey and c1ioo).  Currently everything is looking ok.

I'm worried that there is still a problem with one of the NTP servers that nodus is sync'ing against, and that the problem might come back.  I'll check in again later tonight.

  6915   Thu Jul 5 01:20:58 2012 yutaSummaryCDSslow computers, 0x4000 for all DAQ status

ALS looks OK. I tried to lock FPMI using ALS, but I feel like I need 6 hands to do it with current ALS stability. Now I have all computers being so slow.

It was fine for 7 hours after Jamie the Great fixed this, but fb went down couple times and DAQ status for all models now shows 0x4000. I tried restarting mx_stream and restarting fb, but they didn't help.

  6917   Thu Jul 5 10:49:38 2012 JamieUpdateCDSfront-end/fb communication lost, likely again due to timing offsets

All the front-ends are showing 0x4000 status and have lost communication with the frame builder.  It looks like the timing skew is back again.  The fb is ahead of real time by one second, and strangely nodus is ahead of real time by something like 5 seconds!  I'm looking into it now.

  6918   Thu Jul 5 11:12:53 2012 JenneUpdateCDSfront-end/fb communication lost, likely again due to timing offsets

Quote:

All the front-ends are showing 0x4000 status and have lost communication with the frame builder.  It looks like the timing skew is back again.  The fb is ahead of real time by one second, and strangely nodus is ahead of real time by something like 5 seconds!  I'm looking into it now.

 I was bad and didn't read the elog before touching things, so I did a daqd restart, and mxstream restart on all the front ends, but neither of those things helped.  Then I saw the elog that Jamie's working on figuring it out.

  6920   Thu Jul 5 12:27:05 2012 JamieUpdateCDSfront-end/fb communication lost, likely again due to timing offsets

Quote:

All the front-ends are showing 0x4000 status and have lost communication with the frame builder.  It looks like the timing skew is back again.  The fb is ahead of real time by one second, and strangely nodus is ahead of real time by something like 5 seconds!  I'm looking into it now.

This was indeed another leap second timing issue.  I'm guessing nodus resync'd from whatever server is posting the wrong time, and it brought everything out of sync again.  It really looks like the caltech server is off.  When I manually sync form there the time is off by a second, and then when I manually sync from the global pool it is correct.

I went ahead and updated nodus's config (/etc/inet/ntp.conf) to point to the global pool (pool.ntp.org).  I then restarted the ntp daemon:

  nodus$ sudo /etc/init.d/xntpd stop
  nodus$ sudo /etc/init.d/xntpd start

That brought nodus's time in sync.

At that point all I had to do was resync the time on fb:

  fb$ sudo /etc/init.d/ntp-client restart

When I did that daqd died, but it immediately restarted and everything was in sync.

  6997   Fri Jul 20 17:11:50 2012 JamieUpdateCDSAll custom MEDM screens moved to cds_users_apps svn repo

Since there are various ongoing requests for this from the sites, I have moved all of our custom MEDM screens into the cds_user_apps SVN repository.  This is what I did:

For each system in /opt/rtcds/caltech/c1/medm, I copied their "master" directory into the repo, and then linked it back in to the usual place, e.g.:

a=/opt/rtcds/caltech/c1/medm/${model}/master
b=/opt/rtcds/userapps/trunk/${system}/c1/medm/${model}
mv $a $b
ln -s $b $a

Before committing to the repo, I did a little bit of cleanup, to remove some binary files and other known superfluous stuff.  But I left most things there, since I don't know what is relevant or not.

Then committed everything to the repo.

 

  6999   Sat Jul 21 14:48:33 2012 DenUpdateCDSRCG

As I've spent many hours trying to determine the error in my C code for online filter I decided to write about it to prevent people from doing it again.

I have a C function that was tested offline. I compiled and installed it on the front end machine without any errors. When I've restarted the model, it did not run.

I modified the function the following way

void myFunction()
{
if(STATEMENT) return;
some code
}

I've adjusted input parameters such that STATEMENT was always true. However the model either started or not depending on the code after if statement. It turned out that the model could not start because of the following lines


cosine[1] = 1.0 - 0.5*a*a + a*a*a*a/24 - a*a*a*a*a*a/720 + a*a*a*a*a*a*a*a/40320;
sine[1] = a - a*a*a/6 + a*a*a*a*a/120 - a*a*a*a*a*a*a/5040;

When I've split the sum into steps, the model began to run. I guess the conclusion is that we can not make too many arithmetical operations for one "=" . The most interesting thing is that these lines stood after true if-statement and should not be even executed. Possible explanation is that some compilers start to process code after if-statement during its slow comparison. In our case it could start and then broke down on these long expressions.

  7008   Mon Jul 23 18:57:52 2012 JamieUpdateCDSc1scx and c1scy models recompiled and restarted

After the changes listed in 7005 and 7007, I have rebuilt, installed, and restarted the c1scx and c1scy models.  Everything seems to have come back up ok.

Running into some daqd troubles because of a change to c1ioo, but will report on the new ALS channels when I can.

  7011   Mon Jul 23 19:50:43 2012 JamieUpdateCDSc1gcv model renamed to c1als

I decided to rename the c1gcv model to be c1als.  This is in an ongoing effort to rename all the ALS stuff as ALS, and get rid of the various GC{V,X,Y} named stuff.

Most of what was in the c1gcv model was already in a subsystem with and ALS top names, but there were a couple of channels that were outside of that that had funky names, namely the "GCV_GREEN" channels.  This fixes that, and make things more consistent and simple.

Of course this required a bunch of other little changes:

  • rename model in userapps svn
  • target/fb/master had to be modified to point to the new chans/daq/C1ALS.ini channel file and gds/param/tpchn_c1als.par testpoint file
  • rename RFM channels appropriately, and fix in receiver models (c1scx, c1scy, c1mcs)
  • move custom medm screens in userapps svn (isc/c1/medm/c1als), and link to it at medm/c1als/master
  • moved old medm/c1gcv directory into a subdirectory of medm/c1als
  • update all medm screens that point to c1gcv stuff (mostly just ALS screens)

The above has been done.  Still todo:

  • FIX SCRIPTS!  There are almost certainly scripts that point to GC{V,X,Y} channels.  Those will have to be fixed as we come across them.
  • Fix the c1sc{x,y}/master/C1SC{X,Y}_GC{X,Y}_SLOW.adl screens.  I need to figure out a more consistent place for those screens.
  • Fix the C1ALS_COMPACT screen
  • ???

 

  7037   Thu Jul 26 12:10:28 2012 DenUpdateCDSnew c1tst model for testing RCG code

Quote:

I made a new model, c1tst, that we can use for debugging the FREQUENT RCG bugs that we keep encountering.  It's a bare model that runs on c1iscey.  Don't do any thing important in here, and don't leave it in some crappy state.  Clean if up when you're done.

 I wanted to test biquad form in this model. I added biquad=1 flag to cdsParameters, compiled, installed and restarted it. After that c1iscey suspended.

The same thing as we had several month ago

controls@c1iscey /opt/rtcds/caltech/c1/target/c1tst/c1tstepics 0$ cat iocC1.log

Starting iocInit
iocRun: All initialization complete
sh: iniChk.pl: command not found
Failed to load DAQ configuration file

  7043   Fri Jul 27 14:27:14 2012 JamieUpdateCDSnew c1tst model for testing RCG code

Quote:

 

 I wanted to test biquad form in this model. I added biquad=1 flag to cdsParameters, compiled, installed and restarted it. After that c1iscey suspended.

The same thing as we had several month ago

controls@c1iscey /opt/rtcds/caltech/c1/target/c1tst/c1tstepics 0$ cat iocC1.log

Starting iocInit
iocRun: All initialization complete
sh: iniChk.pl: command not found
Failed to load DAQ configuration file

I have fixed the iniChk.pl issue (which actually fixed a separate model startup-on-boot issue that we had been having).  However, that is completely unrelated to the system freeze.  I'll discuss that in a separate post.

  7046   Fri Jul 27 16:32:17 2012 JamieUpdateCDSRCG bug exposed by simple c1tst model

As Den mentioned in 7043, attempting to run the c1tst model was causing the entire c1iscey machine to crash.  Alex came over this morning and we spend a couple of hours trying to debug what was going on.

c1tst is the simplest possible model you can have: 1 ADC and 2 filter modules.  It compiles just fine, but when you tried to load it the machine would completely freeze.

We eventually tracked this down to a non-empty filter file for one of the filter modules.  It turns out that the model was crashing when it attempted to load the filter file.  Once we completely deleted all the filters in the module, the model would run.  But then if you added back a filter to the filter file and tried to "load coefficients", the model/machine would immediately crash again.

So it has something to do with the loading of the filter coefficients from the filter file.  We tried different filters and it didn't seem to make a difference.  Alex thought it might have something to do with zeros in some of the second-order sections, but that wasn't it either.

There's speculation that it might be related to a very similar bug that Joe reported at LLO a month ago: https://bugzilla.ligo-wa.caltech.edu/bugzilla/show_bug.cgi?id=398

Things we tried, none of which worked:

  • adding a DAC
  • turning on/off biquad
  • disabling the float denormalization fix

This is a real mystery.  Alex and I are continuing to investigate.

  7049   Mon Jul 30 12:38:45 2012 JamieUpdateCDSMove to RCG 2.5 tag release

I moved the RCG to the advLigoRTS-2.5 tag:

controls@c1iscey ~ 0$ ls -al /opt/rtcds/rtscore/release
lrwxrwxrwx 1 controls 1001 19 Jul 30 12:02 /opt/rtcds/rtscore/release -> tags/advLigoRTS-2.5
controls@c1iscey ~ 0$ 

There are only very minor differences between what we were running on the the 2.5 branch.  I have NOT rebuilt all the models yet.

I was hoping that there was something in the tagged release that might fix this hard-crash-on-filter-load issue that we're seeing in the c1tst model.  It didn't.  Still have no idea what's going on there.

  7057   Tue Jul 31 15:17:58 2012 JamieUpdateCDSc1ifo medm screens checked into CSD userapps svn

I moved the medm/c1ifo directory into the CDS userapps svn at cds/c1/medm/c1ifo, and then linked it back into the medm directory:

controls@rossa:~ 0$ ls -al /opt/rtcds/caltech/c1/medm/c1ifo
lrwxrwxrwx 1 controls controls 56 2012-07-31 11:53 /opt/rtcds/caltech/c1/medm/c1ifo -> /opt/rtcds/caltech/c1/userapps/release/cds/c1/medm/c1ifo
controls@rossa:~ 0$

I then committed whatever useful was in there.  We need to remember to commit when we make changes.

  7060   Tue Jul 31 18:59:59 2012 JamieUpdateCDSupdated medm paths for MISC (IFO,CDS,VIDEO), IFO alignment scripts updated accordingly

In an attempt to clean up the medm situation, I did a bunch of further rearrangement and cleanup.  Instead of having c1ifo, I moved a bunch of stuff into a MISC directory, including all of the CDS, IFO, and VIDEO screens:

controls@rossa:~ 0$ ls -1 /opt/rtcds/caltech/c1/medm/MISC | grep -v _BAK
CDS_BIO_STATUS.adl
CDS_FE_STATUS.adl
CDS_IPC_ERR.adl
help
ifoalign
IFO_ALIGN.adl
ifoalign.orig
IFO_CONFIGURE.adl
IFO_CONFIGURE.txt
IFO_OVERVIEW.adl
IFO_OVERVIEW_AIDAN.adl
IFO_STATE.adl
snap
VIDEO.adl
controls@rossa:~ 0$

I updated the sitemap and the relevant screens accordingly.

I also updated the IFO_ALIGN script infrastructure a bit.  The new IFO ALIGN location is /opt/rtcds/caltech/c1/medm/MISC/ifoalign.  The scripts are now called simply:

/opt/rtcds/caltech/c1/medm/MISC/ifoalign/misalign_soft.csh
/opt/rtcds/caltech/c1/medm/MISC/ifoalign/restore_soft.csh
/opt/rtcds/caltech/c1/medm/MISC/ifoalign/save_soft.csh

These run Jenne's new soft restore stuff, and store burt snapshots for optic alignment in /opt/rtcds/caltech/c1/medm/MISC/ifoalign/burt.

This has all been checked into the svn.

  7062   Tue Jul 31 20:59:35 2012 JamieUpdateCDSfixing up MEDM snapshots

I've started to try to fix all the old non-working medm snapshots.  I've made a new snapshot directory, and some new snapshot scripts to handle taking the snapshots, and view old ones.

Snapshots are now in:  /opt/rtcds/caltech/c1/medm/snap

There is one main script: /opt/rtcds/caltech/c1/medm/snap/snapcommands.  This is linked by:

  • updatesnap: update a snapshot
  • viewsnap: view most recent snapshot
  • viewold: view old snapshots

These commands take either the path to the screen in question, or it's relative path to the medm directory (/opt/rtcds/caltech/c1/medm).  The snapshots for a specific screen are stored in a directory specific to the screen, in a place relative to the snap directory that mimics the screens relative path to the overall medm directory.  So for instance, the snap directory for: 

/opt/rtcds/caltech/c1/medm/c1lsc/master/C1LSC_OVERVIEW.adl

is:

controls@rossa:~ 0$ find /opt/rtcds/caltech/c1/medm/snap/c1lsc/master/C1LSC_OVERVIEW/
/opt/rtcds/caltech/c1/medm/snap/c1lsc/master/C1LSC_OVERVIEW/
/opt/rtcds/caltech/c1/medm/snap/c1lsc/master/C1LSC_OVERVIEW/2012-08-01-02:21:34-UTC.png
/opt/rtcds/caltech/c1/medm/snap/c1lsc/master/C1LSC_OVERVIEW/2012-08-01-02:21:04-UTC.png
/opt/rtcds/caltech/c1/medm/snap/c1lsc/master/C1LSC_OVERVIEW/2012-08-01-02:21:23-UTC.png
/opt/rtcds/caltech/c1/medm/snap/c1lsc/master/C1LSC_OVERVIEW/2012-08-01-03:41:34-UTC.png
/opt/rtcds/caltech/c1/medm/snap/c1lsc/master/C1LSC_OVERVIEW/current.png
controls@rossa:~ 0$ 

The "current.png" is a link to the most recent snapshot, and is what you get when you "viewsnap".

Below is an example medm snippet of what could be used in screens to enable this functionality.  I've fixed up a couple of the screens, but there are a lot more that need to be updated.

"shell command" {
    object {
        x=666
        y=597
        width=40
        height=40
    }
    command[0] {
        label="Settings and Procedures"
        name="emacs"
        args="./help/IFO_ALIGN.txt &"
    }
    command[1] {
        label="View Snapshot"
        name="/opt/rtcds/caltech/c1/medm/snap/viewsnap"
        args="MISC/IFO_ALIGN.adl &"
    }
    command[2] {
        label="View old snapshots"
        name="/opt/rtcds/caltech/c1/medm/snap/viewold"
        args="MISC/IFO_ALIGN.adl &"
    }
    command[3] {
        label="Update Snapshot"
        name="/opt/rtcds/caltech/c1/medm/snap/updatesnap"
        args="MISC/IFO_ALIGN.adl &"
    }
    clr=14
    bclr=30
}
  7067   Wed Aug 1 11:50:49 2012 JamieUpdateCDSadded input monitors to LSC_TRIGGER library part

I added an EPICS monitor to the input of the LSC_TRIGGER part, to allow monitoring the signal used for the trigger.  I then added the monitors to the C1LSC_TRIG_MTRX screen (see below).  This should hopefully aid in setting the trigger levels.

Attachment 1: trigmtrx.png
trigmtrx.png
  7086   Sun Aug 5 13:48:40 2012 DenUpdateCDSMove to RCG 2.5 tag release

Quote:

I moved the RCG to the advLigoRTS-2.5 tag

 After that RFM -> OAF communication through PCIE became bad again. Inside CommData2.c cache flushing is not allowed

// If PCIE comms show errors, may want to add this cache flushing
            #if 0
            if(ipcInfo[ii].netType == IPCIE)
                clflush_cache_range (ipcInfo[ii].pIpcData->dBlock[sendBlock][ipcIndex].data, 16);
            #endif

As a result, a significant part of MC_F and other signals is lost during RFM -> OAF transmission (270 - 330 out of 2048 per second)

erros.png   overview.png    oaf.png

 


Last time when I replaced 0 for 1, it suspended SUS machine because of the code bug. Alex modified a couple of files in the old version and it started to work. Do you know if this bug is fixed in the new version?

  7093   Mon Aug 6 19:37:50 2012 JamieUpdateCDSdaqd and CDS network problems today

For some reason this afternoon we've been experiencing a lot of problems with the framebuilder, and with the CDS network in general.  The framebuilder has been very unresponsive, although the daqd logs seem to indicate that things are ok.  All models will loose contact with fb for very long stretches.  Attempts to kill/restart daqd don't seem to fix the problem.

These problems seem to be associated with the general CDS network issues as well.  The network seems to become very slow, and the workstations all become very slow.  The later I assume is because of the network and that so much of the work we do is on network mounted filesystems (/opt/rtcds, /ligo, etc.).

My current speculation is that daqd on fb is doing something stupid, like trying to read or write a bunch of stuff from /frames, which is also network mounted, and that clogs up the entire network.  Some serious network debugging is going to be needed to figure out what's going on, though.

I'm afraid daqd is caught in some bad state now, though.  It's not responding to anything, and every attempt to kill it seems to bring it back into the bad state.  Hopefully I can get Alex to help me figure out what's going on tomorrow.   Maybe it will clear up on it's own tonight...

  7094   Mon Aug 6 19:54:53 2012 JamieUpdateCDSdaqd and CDS network problems today

It looks like daqd is indeed caught in some bad state.  It seems to die at some point after making GPS corrections to minute trender:

...
[Mon Aug  6 19:45:13 2012] Minute trender made GPS time correction; gps=1028342727; gps%60=27
tail: `fb/logs/daqd.log' has been replaced;  following end of new file
263596
MX endpoint opened
startup file interpreter thread tid=140334118615312
calling yyparse(5, 6)
[Mon Aug  6 19:50:08 2012] ->5: #set avoid_reconnect
[Mon Aug  6 19:50:08 2012] ->5: set thread_stack_size=102400
[Mon Aug  6 19:50:08 2012] new threads will be created with the stack of size 102400K
[Mon Aug  6 19:50:08 2012] ->5: set allow_tpman_connect_fail
[Mon Aug  6 19:50:08 2012] ->5: #set dcu_status_check=5
[Mon Aug  6 19:50:08 2012] ->5: #set symm_gps_offset=-1
[Mon Aug  6 19:50:08 2012] ->5: #set symm_gps_offset=31535998
[Mon Aug  6 19:50:08 2012] ->5: ##set symm_gps_offset=347155213
[Mon Aug  6 19:50:08 2012] ->5: #set symm_gps_offset=378691215
[Mon Aug  6 19:50:08 2012] ->5: #set symm_gps_offset=378691212
[Mon Aug  6 19:50:08 2012] ->5: #set symm_gps_offset=315964799
[Mon Aug  6 19:50:08 2012] ->5: set symm_gps_offset=315964801
[Mon Aug  6 19:50:08 2012] ->5: set debug=0
[Mon Aug  6 19:50:08 2012] ->5: set log=2
[Mon Aug  6 19:50:08 2012] ->5: set zero_bad_data=0
[Mon Aug  6 19:50:08 2012] ->5: set dcu_status_check=9
[Mon Aug  6 19:50:08 2012] ->5: set controller_dcu=33
[Mon Aug  6 19:50:08 2012] ->5: set master_config="/opt/rtcds/caltech/c1/target/fb/master"
[Mon Aug  6 19:50:10 2012] finished configuring data channels
[Mon Aug  6 19:50:10 2012] ->5: configure channels begin end
Unable to find GDS node 90 system c1x00 in INI files
Unable to find GDS node 92 system c1tst2 in INI files
Unable to find GDS node 95 system c1x10 in INI files
[Mon Aug  6 19:50:10 2012] ->5: tpconfig "/opt/rtcds/caltech/c1/target/gds/param/testpoint.par"
[Mon Aug  6 19:50:10 2012] ->5: set gps_leaps = 820108813
[Mon Aug  6 19:50:10 2012] ->5: set detector_name="CIT"
[Mon Aug  6 19:50:10 2012] ->5: set detector_prefix="C1"
[Mon Aug  6 19:50:10 2012] ->5: set detector_longitude=-90.7742403889
[Mon Aug  6 19:50:10 2012] ->5: set detector_latitude=30.5628943337
[Mon Aug  6 19:50:10 2012] ->5: set detector_elevation=.0
[Mon Aug  6 19:50:10 2012] ->5: set detector_azimuths=1.1,4.7123889804
[Mon Aug  6 19:50:10 2012] ->5: set detector_altitudes=1.0,2.0
[Mon Aug  6 19:50:10 2012] ->5: set detector_midpoints=2000.0, 2000.0
[Mon Aug  6 19:50:10 2012] ->5: set num_dirs = 10
[Mon Aug  6 19:50:10 2012] ->5: set frames_per_dir=225
[Mon Aug  6 19:50:10 2012] ->5: set full_frames_per_file=1
[Mon Aug  6 19:50:10 2012] ->5: set full_frames_blocks_per_frame=16
[Mon Aug  6 19:50:10 2012] ->5: set frame_dir="/frames/full", "C-R-", ".gwf"
[Mon Aug  6 19:50:10 2012] ->5: set trend_num_dirs=10
[Mon Aug  6 19:50:10 2012] ->5: set trend_frames_per_dir=1440
[Mon Aug  6 19:50:10 2012] ->5: set trend_frame_dir= "/frames/trend/second", "C-T-", ".gwf"
[Mon Aug  6 19:50:10 2012] ->5: set raw-minute-trend-dir="/frames/trend/minute_raw"
[Mon Aug  6 19:50:10 2012] ->5: set nds-jobs-dir="/opt/rtcds/caltech/c1/target/fb"
[Mon Aug  6 19:50:10 2012] ->5: set minute-trend-num-dirs=10
[Mon Aug  6 19:50:10 2012] ->5: set minute-trend-frames-per-dir=24
[Mon Aug  6 19:50:10 2012] ->5: set minute-trend-frame-dir="/frames/trend/minute", "C-M-", ".gwf"
[Mon Aug  6 19:50:10 2012] ->5: start main 10
[Mon Aug  6 19:50:12 2012] main started
[Mon Aug  6 19:50:12 2012] ->5: start profiler
[Mon Aug  6 19:50:12 2012] ->5: # comment out this block to stop saving data
[Mon Aug  6 19:50:12 2012] frame saver started
[Mon Aug  6 19:50:12 2012] ->5: start frame-saver
[Mon Aug  6 19:50:13 2012] ->5: sync frame-saver
[Mon Aug  6 19:50:13 2012] ->5: start trender
[Mon Aug  6 19:50:13 2012] trender started
[Mon Aug  6 19:50:13 2012] trend frame saver started
[Mon Aug  6 19:50:13 2012] ->5: start trend-frame-saver
[Mon Aug  6 19:50:14 2012] ->5: sync trend-frame-saver
[Mon Aug  6 19:50:14 2012] minute trend frame saver started
[Mon Aug  6 19:50:14 2012] ->5: start minute-trend-frame-saver
[Mon Aug  6 19:50:14 2012] Done creating ADC structures
[Mon Aug  6 19:50:15 2012] ->5: sync minute-trend-frame-saver
[Mon Aug  6 19:50:15 2012] raw minute trend frame saver started
[Mon Aug  6 19:50:15 2012] ->5: start raw_minute_trend_saver
[Mon Aug  6 19:50:15 2012] ->5: #frame-writer "225.225.225.1" broadcast="131.215.113.0" all
[Mon Aug  6 19:50:15 2012] ->5: #sleep 5
[Mon Aug  6 19:50:15 2012] producer started
[Mon Aug  6 19:50:15 2012] ->5: start producer
[Mon Aug  6 19:50:15 2012] ->5: start epics dcu
[Mon Aug  6 19:50:15 2012] MX receiver thread started
[Mon Aug  6 19:50:15 2012] edcu started
[Mon Aug  6 19:50:15 2012] ->5: start epics server "C0:DAQ-DC0_" "C1:DAQ-DC0_"
[Mon Aug  6 19:50:15 2012] epics server started
[Mon Aug  6 19:50:15 2012] ->5: start listener 8087
[Mon Aug  6 19:50:15 2012] ->5: start listener 8088 1
[Mon Aug  6 19:50:15 2012] ->5: sleep 60
[Mon Aug  6 19:50:15 2012] Epics server started
[Mon Aug  6 19:50:15 2012] EDCU has 2553 channels configured; first=0

[Mon Aug  6 19:50:18 2012] Minute trender made GPS time correction; gps=1028343032; gps%60=32
...

The "tail:..." line indicates that the log was moved and replaced, which indicates a daqd restart.  As far as I know this was not manually triggered.

After the restart the same thing happens again.  About once every five minutes.

  7095   Mon Aug 6 20:08:45 2012 JamieUpdateCDSdaqd and CDS network problems today

When daqd is caught in this state it can not be killed.  It's in "uninterruptable sleep" ('D' state in the top output below).  This usually indicates that it's waiting for the kernel, usually due to some missing or hung IO.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                      
28038 controls  20   0 4430m 2.0g  19m D    0 27.1   0:15.00 daqd                                                                                         

The memory footprint also seems to be getting big.  It's clearly trying to do something stupid that it can't handle.

  7096   Mon Aug 6 20:22:50 2012 JamieUpdateCDSdaqd segfaulting after five minutes

I tried running daqd manually, and sure enough it segfaults after about five minutes (see log below).  I've uncommented it from /etc/inittab on fb and I'm leaving it off for now until we can figure out what's going on.

controls@fb /opt/rtcds/caltech/c1/target/fb 0$ /opt/rtcds/caltech/c1/target/fb/daqd -c /opt/rtcds/caltech/c1/target/fb/daqdrc
263596
MX endpoint opened
startup file interpreter thread tid=139790943115536
calling yyparse(5, 6)
[Mon Aug  6 20:15:27 2012] ->5: #set avoid_reconnect
[Mon Aug  6 20:15:27 2012] ->5: set thread_stack_size=102400
[Mon Aug  6 20:15:27 2012] new threads will be created with the stack of size 102400K
[Mon Aug  6 20:15:27 2012] ->5: set allow_tpman_connect_fail
[Mon Aug  6 20:15:27 2012] ->5: #set dcu_status_check=5
[Mon Aug  6 20:15:27 2012] ->5: #set symm_gps_offset=-1
[Mon Aug  6 20:15:27 2012] ->5: #set symm_gps_offset=31535998
[Mon Aug  6 20:15:27 2012] ->5: ##set symm_gps_offset=347155213
[Mon Aug  6 20:15:27 2012] ->5: #set symm_gps_offset=378691215
[Mon Aug  6 20:15:27 2012] ->5: #set symm_gps_offset=378691212
[Mon Aug  6 20:15:27 2012] ->5: #set symm_gps_offset=315964799
[Mon Aug  6 20:15:27 2012] ->5: set symm_gps_offset=315964801
[Mon Aug  6 20:15:27 2012] ->5: set debug=0
[Mon Aug  6 20:15:27 2012] ->5: set log=2
[Mon Aug  6 20:15:27 2012] ->5: set zero_bad_data=0
[Mon Aug  6 20:15:27 2012] ->5: set dcu_status_check=9
[Mon Aug  6 20:15:27 2012] ->5: set controller_dcu=33
[Mon Aug  6 20:15:27 2012] ->5: set master_config="/opt/rtcds/caltech/c1/target/fb/master"
[Mon Aug  6 20:15:30 2012] finished configuring data channels
[Mon Aug  6 20:15:30 2012] ->5: configure channels begin end
GDS server NODE=19 HOST=c1iscex DCUID=19
GDS server NODE=20 HOST=c1sus DCUID=20
GDS server NODE=21 HOST=c1sus DCUID=21
GDS server NODE=22 HOST=c1lsc DCUID=22
GDS server NODE=25 HOST=c1iscex DCUID=61
GDS server NODE=28 HOST=c1ioo DCUID=28
GDS server NODE=33 HOST=c1ioo DCUID=33
GDS server NODE=34 HOST=c1ioo DCUID=34
GDS server NODE=36 HOST=c1sus DCUID=36
GDS server NODE=38 HOST=c1sus DCUID=38
GDS server NODE=39 HOST=c1sus DCUID=39
GDS server NODE=40 HOST=c1lsc DCUID=40
GDS server NODE=42 HOST=c1lsc DCUID=42
GDS server NODE=45 HOST=c1iscex DCUID=45
GDS server NODE=46 HOST=c1iscey DCUID=46
GDS server NODE=47 HOST=c1iscey DCUID=47
GDS server NODE=48 HOST=c1lsc DCUID=48
GDS server NODE=50 HOST=c1lsc DCUID=50
GDS server NODE=51 HOST=c1ioo DCUID=51
GDS server NODE=60 HOST=c1lsc DCUID=60
GDS server NODE=61 HOST=c1iscex DCUID=61
GDS server NODE=62 HOST=c1sus DCUID=62
Unable to find GDS node 90 system c1x00 in INI files
GDS server NODE=91 HOST=c1lsc DCUID=60
Unable to find GDS node 92 system c1tst2 in INI files
Unable to find GDS node 95 system c1x10 in INI files
TP: node = 19, host = c1iscex, dup = 0, prog = 0x31002013, vers = 1
Initialized TP interface node=19, host=c1iscex
TP: node = 20, host = c1sus, dup = 0, prog = 0x31002014, vers = 1
Initialized TP interface node=20, host=c1sus
TP: node = 21, host = c1sus, dup = 0, prog = 0x31002015, vers = 1
Initialized TP interface node=21, host=c1sus
TP: node = 22, host = c1lsc, dup = 0, prog = 0x31002016, vers = 1
Initialized TP interface node=22, host=c1lsc
TP: node = 25, host = c1iscex, dup = 0, prog = 0x31002019, vers = 1
Initialized TP interface node=25, host=c1iscex
TP: node = 28, host = c1ioo, dup = 0, prog = 0x3100201c, vers = 1
Initialized TP interface node=28, host=c1ioo
TP: node = 33, host = c1ioo, dup = 0, prog = 0x31002021, vers = 1
Initialized TP interface node=33, host=c1ioo
TP: node = 34, host = c1ioo, dup = 0, prog = 0x31002022, vers = 1
Initialized TP interface node=34, host=c1ioo
TP: node = 36, host = c1sus, dup = 0, prog = 0x31002024, vers = 1
Initialized TP interface node=36, host=c1sus
TP: node = 38, host = c1sus, dup = 0, prog = 0x31002026, vers = 1
Initialized TP interface node=38, host=c1sus
TP: node = 39, host = c1sus, dup = 0, prog = 0x31002027, vers = 1
Initialized TP interface node=39, host=c1sus
TP: node = 40, host = c1lsc, dup = 0, prog = 0x31002028, vers = 1
Initialized TP interface node=40, host=c1lsc
TP: node = 42, host = c1lsc, dup = 0, prog = 0x3100202a, vers = 1
Initialized TP interface node=42, host=c1lsc
TP: node = 45, host = c1iscex, dup = 0, prog = 0x3100202d, vers = 1
Initialized TP interface node=45, host=c1iscex
TP: node = 46, host = c1iscey, dup = 0, prog = 0x3100202e, vers = 1
Initialized TP interface node=46, host=c1iscey
TP: node = 47, host = c1iscey, dup = 0, prog = 0x3100202f, vers = 1
Initialized TP interface node=47, host=c1iscey
TP: node = 48, host = c1lsc, dup = 0, prog = 0x31002030, vers = 1
Initialized TP interface node=48, host=c1lsc
TP: node = 50, host = c1lsc, dup = 0, prog = 0x31002032, vers = 1
Initialized TP interface node=50, host=c1lsc
TP: node = 51, host = c1ioo, dup = 0, prog = 0x31002033, vers = 1
Initialized TP interface node=51, host=c1ioo
TP: node = 60, host = c1lsc, dup = 0, prog = 0x3100203c, vers = 1
Initialized TP interface node=60, host=c1lsc
TP: node = 61, host = c1iscex, dup = 0, prog = 0x3100203d, vers = 1
Initialized TP interface node=61, host=c1iscex
TP: node = 62, host = c1sus, dup = 0, prog = 0x3100203e, vers = 1
Initialized TP interface node=62, host=c1sus
TP: node = 91, host = c1lsc, dup = 0, prog = 0x3100205b, vers = 1
Initialized TP interface node=91, host=c1lsc
[Mon Aug  6 20:15:30 2012] ->5: tpconfig "/opt/rtcds/caltech/c1/target/gds/param/testpoint.par"
[Mon Aug  6 20:15:30 2012] ->5: set gps_leaps = 820108813
[Mon Aug  6 20:15:30 2012] ->5: set detector_name="CIT"
[Mon Aug  6 20:15:30 2012] ->5: set detector_prefix="C1"
[Mon Aug  6 20:15:30 2012] ->5: set detector_longitude=-90.7742403889
[Mon Aug  6 20:15:30 2012] ->5: set detector_latitude=30.5628943337
[Mon Aug  6 20:15:30 2012] ->5: set detector_elevation=.0
[Mon Aug  6 20:15:30 2012] ->5: set detector_azimuths=1.1,4.7123889804
[Mon Aug  6 20:15:30 2012] ->5: set detector_altitudes=1.0,2.0
[Mon Aug  6 20:15:30 2012] ->5: set detector_midpoints=2000.0, 2000.0
[Mon Aug  6 20:15:30 2012] ->5: set num_dirs = 10
[Mon Aug  6 20:15:30 2012] ->5: set frames_per_dir=225
[Mon Aug  6 20:15:30 2012] ->5: set full_frames_per_file=1
[Mon Aug  6 20:15:30 2012] ->5: set full_frames_blocks_per_frame=16
[Mon Aug  6 20:15:30 2012] ->5: set frame_dir="/frames/full", "C-R-", ".gwf"
[Mon Aug  6 20:15:30 2012] ->5: set trend_num_dirs=10
[Mon Aug  6 20:15:30 2012] ->5: set trend_frames_per_dir=1440
[Mon Aug  6 20:15:30 2012] ->5: set trend_frame_dir= "/frames/trend/second", "C-T-", ".gwf"
[Mon Aug  6 20:15:30 2012] ->5: set raw-minute-trend-dir="/frames/trend/minute_raw"
[Mon Aug  6 20:15:30 2012] ->5: set nds-jobs-dir="/opt/rtcds/caltech/c1/target/fb"
[Mon Aug  6 20:15:30 2012] ->5: set minute-trend-num-dirs=10
[Mon Aug  6 20:15:30 2012] ->5: set minute-trend-frames-per-dir=24
[Mon Aug  6 20:15:30 2012] ->5: set minute-trend-frame-dir="/frames/trend/minute", "C-M-", ".gwf"
[Mon Aug  6 20:15:30 2012] ->5: start main 10
Allocated move buffer size 11616356 bytes
[Mon Aug  6 20:15:32 2012] main started
[Mon Aug  6 20:15:32 2012] ->5: start profiler
[Mon Aug  6 20:15:32 2012] ->5: # comment out this block to stop saving data
[Mon Aug  6 20:15:32 2012] frame saver started
[Mon Aug  6 20:15:32 2012] ->5: start frame-saver
[Mon Aug  6 20:15:33 2012] ->5: sync frame-saver
[Mon Aug  6 20:15:33 2012] ->5: start trender
[Mon Aug  6 20:15:33 2012] trender started
[Mon Aug  6 20:15:33 2012] trend frame saver started
[Mon Aug  6 20:15:33 2012] ->5: start trend-frame-saver
[Mon Aug  6 20:15:34 2012] ->5: sync trend-frame-saver
[Mon Aug  6 20:15:34 2012] minute trend frame saver started
[Mon Aug  6 20:15:34 2012] ->5: start minute-trend-frame-saver
[Mon Aug  6 20:15:34 2012] Done creating ADC structures
[Mon Aug  6 20:15:35 2012] ->5: sync minute-trend-frame-saver
[Mon Aug  6 20:15:35 2012] raw minute trend frame saver started
[Mon Aug  6 20:15:35 2012] ->5: start raw_minute_trend_saver
[Mon Aug  6 20:15:35 2012] ->5: #frame-writer "225.225.225.1" broadcast="131.215.113.0" all
[Mon Aug  6 20:15:35 2012] ->5: #sleep 5
[Mon Aug  6 20:15:35 2012] producer started
[Mon Aug  6 20:15:35 2012] ->5: start producer
[Mon Aug  6 20:15:35 2012] ->5: start epics dcu
[Mon Aug  6 20:15:35 2012] MX receiver thread started
[Mon Aug  6 20:15:35 2012] edcu started
[Mon Aug  6 20:15:35 2012] ->5: start epics server "C0:DAQ-DC0_" "C1:DAQ-DC0_"
[Mon Aug  6 20:15:35 2012] epics server started
[Mon Aug  6 20:15:35 2012] ->5: start listener 8087
[Mon Aug  6 20:15:35 2012] ->5: start listener 8088 1
[Mon Aug  6 20:15:35 2012] ->5: sleep 60
Creating C1:DAQ-DC0_PEM_SLOW_STATUS
Creating C1:DAQ-DC0_PEM_SLOW_CRC_CPS
Creating C1:DAQ-DC0_PEM_SLOW_CRC_SUM
Creating C1:DAQ-DC0_C1X01_STATUS
Creating C1:DAQ-DC0_C1X01_CRC_CPS
Creating C1:DAQ-DC0_C1X01_CRC_SUM
Creating C1:DAQ-DC0_C1X02_STATUS
Creating C1:DAQ-DC0_C1X02_CRC_CPS
Creating C1:DAQ-DC0_C1X02_CRC_SUM
Creating C1:DAQ-DC0_C1SUS_STATUS
Creating C1:DAQ-DC0_C1SUS_CRC_CPS
Creating C1:DAQ-DC0_C1SUS_CRC_SUM
Creating C1:DAQ-DC0_C1OAF_STATUS
Creating C1:DAQ-DC0_C1OAF_CRC_CPS
Creating C1:DAQ-DC0_C1OAF_CRC_SUM
Creating C1:DAQ-DC0_C1ALS_STATUS
Creating C1:DAQ-DC0_C1ALS_CRC_CPS
Creating C1:DAQ-DC0_C1ALS_CRC_SUM
Creating C1:DAQ-DC0_C1X03_STATUS
Creating C1:DAQ-DC0_C1X03_CRC_CPS
Creating C1:DAQ-DC0_C1X03_CRC_SUM
Creating C1:DAQ-DC0_C1IOO_STATUS
Creating C1:DAQ-DC0_C1IOO_CRC_CPS
Creating C1:DAQ-DC0_C1IOO_CRC_SUM
Creating C1:DAQ-DC0_C1MCS_STATUS
Creating C1:DAQ-DC0_C1MCS_CRC_CPS
Creating C1:DAQ-DC0_C1MCS_CRC_SUM
Creating C1:DAQ-DC0_C1RFM_STATUS
Creating C1:DAQ-DC0_C1RFM_CRC_CPS
Creating C1:DAQ-DC0_C1RFM_CRC_SUM
Creating C1:DAQ-DC0_C1PEM_STATUS
Creating C1:DAQ-DC0_C1PEM_CRC_CPS
Creating C1:DAQ-DC0_C1PEM_CRC_SUM
Creating C1:DAQ-DC0_C1X04_STATUS
Creating C1:DAQ-DC0_C1X04_CRC_CPS
Creating C1:DAQ-DC0_C1X04_CRC_SUM
Creating C1:DAQ-DC0_C1LSC_STATUS
Creating C1:DAQ-DC0_C1LSC_CRC_CPS
Creating C1:DAQ-DC0_C1LSC_CRC_SUM
Creating C1:DAQ-DC0_C1SCX_STATUS
Creating C1:DAQ-DC0_C1SCX_CRC_CPS
Creating C1:DAQ-DC0_C1SCX_CRC_SUM
Creating C1:DAQ-DC0_C1X05_STATUS
Creating C1:DAQ-DC0_C1X05_CRC_CPS
Creating C1:DAQ-DC0_C1X05_CRC_SUM
Creating C1:DAQ-DC0_C1SCY_STATUS
Creating C1:DAQ-DC0_C1SCY_CRC_CPS
Creating C1:DAQ-DC0_C1SCY_CRC_SUM
Creating C1:DAQ-DC0_C1ASS_STATUS
Creating C1:DAQ-DC0_C1ASS_CRC_CPS
Creating C1:DAQ-DC0_C1ASS_CRC_SUM
Creating C1:DAQ-DC0_C1CAL_STATUS
Creating C1:DAQ-DC0_C1CAL_CRC_CPS
Creating C1:DAQ-DC0_C1CAL_CRC_SUM
Creating C1:DAQ-DC0_C1MCC_STATUS
Creating C1:DAQ-DC0_C1MCC_CRC_CPS
Creating C1:DAQ-DC0_C1MCC_CRC_SUM
Creating C1:DAQ-DC0_C1MCP_STATUS
Creating C1:DAQ-DC0_C1MCP_CRC_CPS
Creating C1:DAQ-DC0_C1MCP_CRC_SUM
Creating C1:DAQ-DC0_C1LSP_STATUS
Creating C1:DAQ-DC0_C1LSP_CRC_CPS
Creating C1:DAQ-DC0_C1LSP_CRC_SUM
Creating C1:DAQ-DC0_C1SPX_STATUS
Creating C1:DAQ-DC0_C1SPX_CRC_CPS
Creating C1:DAQ-DC0_C1SPX_CRC_SUM
Creating C1:DAQ-DC0_C1SUP_STATUS
Creating C1:DAQ-DC0_C1SUP_CRC_CPS
Creating C1:DAQ-DC0_C1SUP_CRC_SUM
[Mon Aug  6 20:15:35 2012] Epics server started
[Mon Aug  6 20:15:35 2012] EDCU has 2553 channels configured; first=0

Symmetricom status: LOCKED
Starting at gps 1028344552 prev_gps 1028344552 frac 312500000 f 314094022
[Mon Aug  6 20:15:38 2012] Minute trender made GPS time correction; gps=1028344552; gps%60=52
Segmentation fault (core dumped)

  7101   Tue Aug 7 11:46:24 2012 JamieUpdateCDSAlex working on daqd

Alex is apparently working on daqd (remotely).  I'll report back when I find out more.

  7102   Tue Aug 7 14:17:07 2012 JamieUpdateCDSdaqd running again; related to c1sup issue

So daqd's problem was apparently the bad/non-running c1sup model.  The c1sup model, which I reported on attempting to get running in 7097, was not running because there were no available CPUs on the c1sus FE machine.  This was due to my stupid undercounting of the number of CPUs.  Anyway, for reasons I don't understand, this was causing daqd to segfault.  Removing c1sup from c1sus "fixed" the problem.

Alex agreed that daqd should definitely not be segfaulting in this circumstance.  It's still unclear exactly what daqd was looking at that was causing it to crash.

I'm going to move c1sup to c1iscex, which has a lot of spare CPUs.

  7103   Tue Aug 7 14:34:01 2012 JamieUpdateCDSjk. daqd still segfaulting

Quote:

So daqd's problem was apparently the bad/non-running c1sup model.  The c1sup model, which I reported on attempting to get running in 7097, was not running because there were no available CPUs on the c1sus FE machine.  This was due to my stupid undercounting of the number of CPUs.  Anyway, for reasons I don't understand, this was causing daqd to segfault.  Removing c1sup from c1sus "fixed" the problem.

Alex agreed that daqd should definitely not be segfaulting in this circumstance.  It's still unclear exactly what daqd was looking at that was causing it to crash.

I'm going to move c1sup to c1iscex, which has a lot of spare CPUs.

I spoke too soon.  It's still segfaulting, but at a different place. Alex and I are looking into it.

But another mystery solved is the cause of all the network slowness: the daqd core dump.  When daqd segfaults it dumps it's core, which can typically be >4G, to /opt/rtcds/caltech/c1/target/fb/core.  This is of course an NFS mount from linux1, so it's dumping 4G on the network, which not surprisingly clogs the network.

  7105   Tue Aug 7 15:04:23 2012 JamieUpdateCDSdaqd problem was root-owned files and directories

Apparently the last problem was because of root-owned frame directories that daqd was trying to write to.  During debugging Alex had run daqd as root, but it's supposed to run as controls.  All the /frame directories are supposed to be owned by controls.  When daqd was run as root, it created new frame directories owned by root, which controls couldn't write to when I restarted daqd the proper way.  Once we chown'd the directories daqd started running again.

Alex also put in a "fix" for the core dump problem.  He touched an empty core file owned by root:

-rw-r--r-- 1 root root 0 Aug  7 14:38 /opt/rtcds/caltech/c1/target/fb/core

This will prevent any dying daqd process owned by controls from dumping it's core at that location.  Personally I think this is a horribly hacky "solution" that doesn't actually fix any of the issues that were causing the segfaults to begin with, but it might prevent some of the network slow down we see when the core does dump.  It's mostly just masking the problem, though, so I'm tempted to remove it so we all feel the pain when daqd starts shitting all over the network again.

  7161   Mon Aug 13 16:58:07 2012 jamieUpdateCDSmysterious stuck test points on c1spx model

We were not able to open up any test points in the revived c1spx model (dcuid 61).

Looking at the GDS_TP screen we found that every test point was being held open (C1:FEC-61_GDS_MON_?).  Tried closing all test points, awg and otherwise, with the diag comnand line (diag -l), but it would crash when we attempted to look at the test points for node 61.

Rebuild, install, restart of the model had no affect.  As soon as awgtpman came back up all the testpoints were full again.

I called Alex and he said he had seen this issue before as a problem with the mbuf kernel module.  Somehow the mbuf module was holding those memory locations open and not freeing them.

He suggested we reboot the machine or restart mbuf.  I used the following procedure to restart mbuf:

  • log into c1iscex as controls
  • sudo /etc/init.d/monit stop (needed so that monit doesn't auto-restart the awgtpman processes)
  • rtcds stop all
  • sudo /etc/init.d/mx_stream stop
  • sudo rmmod mbuf
  • sudo modprobe mbuf
  • sudo /etc/init.d/mx_stream start
  • sudo /etc/init.d/monit start
  • rtcds start all

Once this was done, all the test points were cleared.

Alex seems to think this issue is fixed in a newer version of mbuf.  I should probably rebuild and install the updated mbuf kernel module at some point soon to prevent this happening again.

Unfortunately this isn't the end of the story, though.  While the test points were cleared, the channels were still not available from c1spx.

I looked in the framebuilder logs to see if I could see anything suspicious.  Grep'ing for the DCUID (61), I found something that looked a little problematic:

...
GDS server NODE=25 HOST=c1iscex DCUID=61
GDS server NODE=28 HOST=c1ioo DCUID=28
GDS server NODE=33 HOST=c1ioo DCUID=33
GDS server NODE=34 HOST=c1ioo DCUID=34
GDS server NODE=36 HOST=c1sus DCUID=36
GDS server NODE=38 HOST=c1sus DCUID=38
GDS server NODE=39 HOST=c1sus DCUID=39
GDS server NODE=40 HOST=c1lsc DCUID=40
GDS server NODE=42 HOST=c1lsc DCUID=42
GDS server NODE=45 HOST=c1iscex DCUID=45
GDS server NODE=46 HOST=c1iscey DCUID=46
GDS server NODE=47 HOST=c1iscey DCUID=47
GDS server NODE=48 HOST=c1lsc DCUID=48
GDS server NODE=50 HOST=c1lsc DCUID=50
GDS server NODE=60 HOST=c1lsc DCUID=60
GDS server NODE=61 HOST=c1iscex DCUID=61
...

Note that two nodes, 25 and 61, are associated with the same dcuid.  25 was the old dcuid of c1spx, before I renumbered it.  I tracked this down to the target/gds/param/testpoint.par file which had the following:

[C-node25]
hostname=c1iscex
system=c1spx
...
[C-node61]
hostname=c1iscex
system=c1spx

It appears that this file is just amended with new dcuids, so dcuid changes can show up in duplicate.  I removed the offending old stanza and tried restarting fb again...

Unfortunately this didn't fix the issue either.  We're still not seeing any channels for c1spx.

  7162   Mon Aug 13 17:31:19 2012 jamieUpdateCDSmysterious stuck test points on c1spx model

Quote:

Unfortunately this didn't fix the issue either.  We're still not seeing any channels for c1spx.

So I was wrong, the channels are showing up.  I had forgotten that they are showing up under C1SUP, not C1SPX.

  7165   Mon Aug 13 20:12:29 2012 jamieUpdateCDSc1sup model moved to c1lsc machine

I moved the c1sup simplant model to the c1lsc machine, where there was one remaining available processor.  This requires changing a bunch of IPC routing in the c1sus and c1lsp models.  I have rebuilt and installed the models, and have restarted c1sup, but have not restarted c1sus and c1lsp since they're currently in use.  I'll restart them first thing tomorrow.

  7173   Tue Aug 14 11:33:14 2012 Jamie Alex DenUpdateCDSAI and AA filters

When signals are transmitted between the models running at different rates, no AI or AA filters are automatically applied. We need to fix our models.

ai.png

ELOG V3.1.3-