40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Tue Jun 30 11:33:00 2015, Jamie, Summary, CDS, prepping for CDS upgrade 
    Reply  Wed Jul 1 19:16:21 2015, Jamie, Summary, CDS, CDS upgrade in progress 
       Reply  Tue Jul 7 18:27:54 2015, Jamie, Summary, CDS, CDS upgrade: progress! 2.9-RTS-OK.pngC1X04_GDS_TP.png
          Reply  Wed Jul 8 20:37:02 2015, Jamie, Summary, CDS, CDS upgrade: one step forward, two steps back 
             Reply  Wed Jul 8 21:02:02 2015, Jamie, Summary, CDS, CDS upgrade: another step forward, so we're back to where we started (plus a bit?) 
                Reply  Thu Jul 9 13:26:47 2015, Jamie, Summary, CDS, CDS upgrade: new mx 1.2.16 installed 
                   Reply  Thu Jul 9 16:50:13 2015, Jamie, Summary, CDS, CDS upgrade: if all else fails try throwing metal at the problem 
                      Reply  Mon Jul 13 01:11:14 2015, Jamie, Summary, CDS, CDS upgrade: current assessment 
                         Reply  Mon Jul 13 18:12:50 2015, Jamie, Summary, CDS, CDS upgrade: left running in semi-stable configuration 
                            Reply  Tue Jul 14 09:08:37 2015, Jamie, Summary, CDS, CDS upgrade: left running in semi-stable configuration 
                               Reply  Tue Jul 14 10:28:02 2015, ericq, Summary, CDS, CDS upgrade: left running in semi-stable configuration 
                                  Reply  Tue Jul 14 11:57:27 2015, jamie, Summary, CDS, CDS upgrade: left running in semi-stable configuration 
                            Reply  Tue Jul 14 16:51:01 2015, Jamie, Summary, CDS, CDS upgrade: problem is not disk access 
                               Reply  Wed Jul 15 13:19:14 2015, Jamie, Summary, CDS, CDS upgrade: reducing mx end-points as last ditch effort 
                                  Reply  Wed Jul 15 18:19:12 2015, Jamie, Summary, CDS, CDS upgrade: tentative stabilty? 
                                     Reply  Sat Jul 18 15:37:19 2015, Jamie, Summary, CDS, CDS upgrade: current status cds-good.pngsus-damped.png
Message ID: 11396     Entry time: Wed Jul 8 20:37:02 2015     In reply to: 11393     Reply to this: 11397
Author: Jamie 
Type: Summary 
Category: CDS 
Subject: CDS upgrade: one step forward, two steps back 

After determining yesterday that all the daqd issues were coming from the frame writing, I started to dig into it more today.  I also spoke to Keith Thorne, and got some good suggestions from Gerrit Kuhn at GEO.

I  realized that it probably wasn't the trend writing per se, but that turning on more writing to disk was causing increased load on daqd, and consequently on the system itself.  With more frame writing turned on the memory consuption increased to the point of maxing out the physical RAM.  The system the probably starting swaping, which certainly would have choked daqd.

I noticed that fb only had 4G of RAM, which Keith suggested was just not enough.  Even if the memory consumption of daqd has increased significantly, it still seems like 4G would not be enough.  I opened up fb only to find that fb actually had 8G of RAM installed!  Not sure what happend to the other 4G, but somehow they were not visible to the system.  Koji and I eventually determined, via some frankenstein operations with megatron, that the RAM was just dead.  We then pulled 4G of RAM from megatron and replaced the bad RAM in fb, so that fb now has a full 8G of RAM cool.

Unfortunately, when we got fb fully back up and running we found that fb is not able to see any of the other hosts on the data concentrator network sad.  mx_info, which displays the card and network status for the myricom myrinet fiber card, shows:

MX Version: 1.2.16
MX Build: controls@fb:/opt/src/mx-1.2.16 Tue May 21 10:58:40 PDT 2013
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
    8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0:  299.8 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
    Status:        Running, P0: Wrong Network
    Network:    Myrinet 10G

    MAC Address:    00:60:dd:46:ea:ec
    Product code:    10G-PCIE-8AL-S
    Part number:    09-03916
    Serial number:    352143
    Mapper:        00:60:dd:46:ea:ec, version = 0x63e745ee, configured
    Mapped hosts:    1

                                                        ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                        P0
-----    -----------     ---------                        ---
   0) 00:60:dd:46:ea:ec fb:0                            D 0,0

Note that all front end machines should be listed in the table at the bottom, and they're not.   Also note the "Wrong Network" note in the Status line above.  It appears that the card has maybe been initialized in a bad state?  Or Koji and I somehow disturbed the network when we were cleaning up things in the rack.  "sudo /etc/init.d/mx restart" on fb doesn't solve the problem.  We even rebooted fb and it didn't seem to help.

In any event, we're back to no data flow.  I'll pick up again tomorrow.

ELOG V3.1.3-