40m QIL Cryo_Lab CTN SUS_Lab CAML OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m elog  Not logged in ELOG logo
Entry  Mon Aug 13 16:58:07 2012, jamie, Update, CDS, mysterious stuck test points on c1spx model 
    Reply  Mon Aug 13 17:31:19 2012, jamie, Update, CDS, mysterious stuck test points on c1spx model 
Message ID: 7161     Entry time: Mon Aug 13 16:58:07 2012     Reply to this: 7162
Author: jamie 
Type: Update 
Category: CDS 
Subject: mysterious stuck test points on c1spx model 

We were not able to open up any test points in the revived c1spx model (dcuid 61).

Looking at the GDS_TP screen we found that every test point was being held open (C1:FEC-61_GDS_MON_?).  Tried closing all test points, awg and otherwise, with the diag comnand line (diag -l), but it would crash when we attempted to look at the test points for node 61.

Rebuild, install, restart of the model had no affect.  As soon as awgtpman came back up all the testpoints were full again.

I called Alex and he said he had seen this issue before as a problem with the mbuf kernel module.  Somehow the mbuf module was holding those memory locations open and not freeing them.

He suggested we reboot the machine or restart mbuf.  I used the following procedure to restart mbuf:

  • log into c1iscex as controls
  • sudo /etc/init.d/monit stop (needed so that monit doesn't auto-restart the awgtpman processes)
  • rtcds stop all
  • sudo /etc/init.d/mx_stream stop
  • sudo rmmod mbuf
  • sudo modprobe mbuf
  • sudo /etc/init.d/mx_stream start
  • sudo /etc/init.d/monit start
  • rtcds start all

Once this was done, all the test points were cleared.

Alex seems to think this issue is fixed in a newer version of mbuf.  I should probably rebuild and install the updated mbuf kernel module at some point soon to prevent this happening again.

Unfortunately this isn't the end of the story, though.  While the test points were cleared, the channels were still not available from c1spx.

I looked in the framebuilder logs to see if I could see anything suspicious.  Grep'ing for the DCUID (61), I found something that looked a little problematic:

...
GDS server NODE=25 HOST=c1iscex DCUID=61
GDS server NODE=28 HOST=c1ioo DCUID=28
GDS server NODE=33 HOST=c1ioo DCUID=33
GDS server NODE=34 HOST=c1ioo DCUID=34
GDS server NODE=36 HOST=c1sus DCUID=36
GDS server NODE=38 HOST=c1sus DCUID=38
GDS server NODE=39 HOST=c1sus DCUID=39
GDS server NODE=40 HOST=c1lsc DCUID=40
GDS server NODE=42 HOST=c1lsc DCUID=42
GDS server NODE=45 HOST=c1iscex DCUID=45
GDS server NODE=46 HOST=c1iscey DCUID=46
GDS server NODE=47 HOST=c1iscey DCUID=47
GDS server NODE=48 HOST=c1lsc DCUID=48
GDS server NODE=50 HOST=c1lsc DCUID=50
GDS server NODE=60 HOST=c1lsc DCUID=60
GDS server NODE=61 HOST=c1iscex DCUID=61
...

Note that two nodes, 25 and 61, are associated with the same dcuid.  25 was the old dcuid of c1spx, before I renumbered it.  I tracked this down to the target/gds/param/testpoint.par file which had the following:

[C-node25]
hostname=c1iscex
system=c1spx
...
[C-node61]
hostname=c1iscex
system=c1spx

It appears that this file is just amended with new dcuids, so dcuid changes can show up in duplicate.  I removed the offending old stanza and tried restarting fb again...

Unfortunately this didn't fix the issue either.  We're still not seeing any channels for c1spx.

ELOG V3.1.3-