We were not able to open up any test points in the revived c1spx model (dcuid 61).
Looking at the GDS_TP screen we found that every test point was being held open (C1:FEC-61_GDS_MON_?). Tried closing all test points, awg and otherwise, with the diag comnand line (diag -l), but it would crash when we attempted to look at the test points for node 61.
Rebuild, install, restart of the model had no affect. As soon as awgtpman came back up all the testpoints were full again.
I called Alex and he said he had seen this issue before as a problem with the mbuf kernel module. Somehow the mbuf module was holding those memory locations open and not freeing them.
He suggested we reboot the machine or restart mbuf. I used the following procedure to restart mbuf:
- log into c1iscex as controls
- sudo /etc/init.d/monit stop (needed so that monit doesn't auto-restart the awgtpman processes)
- rtcds stop all
- sudo /etc/init.d/mx_stream stop
- sudo rmmod mbuf
- sudo modprobe mbuf
- sudo /etc/init.d/mx_stream start
- sudo /etc/init.d/monit start
- rtcds start all
Once this was done, all the test points were cleared.
Alex seems to think this issue is fixed in a newer version of mbuf. I should probably rebuild and install the updated mbuf kernel module at some point soon to prevent this happening again.
Unfortunately this isn't the end of the story, though. While the test points were cleared, the channels were still not available from c1spx.
I looked in the framebuilder logs to see if I could see anything suspicious. Grep'ing for the DCUID (61), I found something that looked a little problematic:
...
GDS server NODE=25 HOST=c1iscex DCUID=61
GDS server NODE=28 HOST=c1ioo DCUID=28
GDS server NODE=33 HOST=c1ioo DCUID=33
GDS server NODE=34 HOST=c1ioo DCUID=34
GDS server NODE=36 HOST=c1sus DCUID=36
GDS server NODE=38 HOST=c1sus DCUID=38
GDS server NODE=39 HOST=c1sus DCUID=39
GDS server NODE=40 HOST=c1lsc DCUID=40
GDS server NODE=42 HOST=c1lsc DCUID=42
GDS server NODE=45 HOST=c1iscex DCUID=45
GDS server NODE=46 HOST=c1iscey DCUID=46
GDS server NODE=47 HOST=c1iscey DCUID=47
GDS server NODE=48 HOST=c1lsc DCUID=48
GDS server NODE=50 HOST=c1lsc DCUID=50
GDS server NODE=60 HOST=c1lsc DCUID=60
GDS server NODE=61 HOST=c1iscex DCUID=61
...
Note that two nodes, 25 and 61, are associated with the same dcuid. 25 was the old dcuid of c1spx, before I renumbered it. I tracked this down to the target/gds/param/testpoint.par file which had the following:
[C-node25]
hostname=c1iscex
system=c1spx
...
[C-node61]
hostname=c1iscex
system=c1spx
It appears that this file is just amended with new dcuids, so dcuid changes can show up in duplicate. I removed the offending old stanza and tried restarting fb again...
Unfortunately this didn't fix the issue either. We're still not seeing any channels for c1spx. |