40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Tue Jun 30 11:33:00 2015, Jamie, Summary, CDS, prepping for CDS upgrade 
    Reply  Wed Jul 1 19:16:21 2015, Jamie, Summary, CDS, CDS upgrade in progress 
       Reply  Tue Jul 7 18:27:54 2015, Jamie, Summary, CDS, CDS upgrade: progress! 2.9-RTS-OK.pngC1X04_GDS_TP.png
          Reply  Wed Jul 8 20:37:02 2015, Jamie, Summary, CDS, CDS upgrade: one step forward, two steps back 
             Reply  Wed Jul 8 21:02:02 2015, Jamie, Summary, CDS, CDS upgrade: another step forward, so we're back to where we started (plus a bit?) 
                Reply  Thu Jul 9 13:26:47 2015, Jamie, Summary, CDS, CDS upgrade: new mx 1.2.16 installed 
                   Reply  Thu Jul 9 16:50:13 2015, Jamie, Summary, CDS, CDS upgrade: if all else fails try throwing metal at the problem 
                      Reply  Mon Jul 13 01:11:14 2015, Jamie, Summary, CDS, CDS upgrade: current assessment 
                         Reply  Mon Jul 13 18:12:50 2015, Jamie, Summary, CDS, CDS upgrade: left running in semi-stable configuration 
                            Reply  Tue Jul 14 09:08:37 2015, Jamie, Summary, CDS, CDS upgrade: left running in semi-stable configuration 
                               Reply  Tue Jul 14 10:28:02 2015, ericq, Summary, CDS, CDS upgrade: left running in semi-stable configuration 
                                  Reply  Tue Jul 14 11:57:27 2015, jamie, Summary, CDS, CDS upgrade: left running in semi-stable configuration 
                            Reply  Tue Jul 14 16:51:01 2015, Jamie, Summary, CDS, CDS upgrade: problem is not disk access 
                               Reply  Wed Jul 15 13:19:14 2015, Jamie, Summary, CDS, CDS upgrade: reducing mx end-points as last ditch effort 
                                  Reply  Wed Jul 15 18:19:12 2015, Jamie, Summary, CDS, CDS upgrade: tentative stabilty? 
                                     Reply  Sat Jul 18 15:37:19 2015, Jamie, Summary, CDS, CDS upgrade: current status cds-good.pngsus-damped.png
Message ID: 11397     Entry time: Wed Jul 8 21:02:02 2015     In reply to: 11396     Reply to this: 11398
Author: Jamie 
Type: Summary 
Category: CDS 
Subject: CDS upgrade: another step forward, so we're back to where we started (plus a bit?) 

Koji did a bit of googling to determine that 'Wrong Network' status message could be explained by the fb myrinet  operating in the wrong mode:
(This was the useful link to track down the issue (KA))
 

    Network:    Myrinet 10G

I didn't notice it before, but we should in fact be operating in "Ethernet" mode, since that's the fabric we're using for the DC network.  Digging a bit deeper we found that the new version of mx (1.2.16) had indeed been configured with a different compile option than the 1.2.15 version had:

controls@fb ~ 0$ grep '$ ./configure' /opt/src/mx-1.2.15/config.log          
  $ ./configure --enable-ether-mode --prefix=/opt/mx
controls@fb ~ 0$ grep '$ ./configure' /opt/src/mx-1.2.16/config.log
  $ ./configure --enable-mx-wire --prefix=/opt/mx-1.2.16
controls@fb ~ 0$

So that would entirely explain the problem.  I re-linked mx to the older version (1.2.15), reloaded the mx drivers, and everything showed up correctly:

controls@fb ~ 0$ /opt/mx/bin/mx_info
MX Version: 1.2.12
MX Build: root@fb:/root/mx-1.2.12 Mon Nov  1 13:34:38 PDT 2010
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
    8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0:  299.8 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
    Status:        Running, P0: Link Up
    Network:    Ethernet 10G

    MAC Address:    00:60:dd:46:ea:ec
    Product code:    10G-PCIE-8AL-S
    Part number:    09-03916
    Serial number:    352143
    Mapper:        00:60:dd:46:ea:ec, version = 0x00000000, configured
    Mapped hosts:    6

                                                        ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                        P0
-----    -----------     ---------                        ---
   0) 00:60:dd:46:ea:ec fb:0                              1,0
   1) 00:25:90:0d:75:bb c1sus:0                           1,0
   2) 00:30:48:be:11:5d c1iscex:0                         1,0
   3) 00:30:48:d6:11:17 c1iscey:0                         1,0
   4) 00:30:48:bf:69:4f c1lsc:0                           1,0
   5) 00:14:4f:40:64:25 c1ioo:0                           1,0
controls@fb ~ 0$

The front end hosts are also showing good omx info (even though they had been previously as well):

controls@c1lsc ~ 0$ /opt/open-mx/bin/omx_info
Open-MX version 1.5.2
 build: controls@fb:/opt/src/open-mx-1.5.2 Tue May 21 11:03:54 PDT 2013

Found 1 boards (32 max) supporting 32 endpoints each:
 c1lsc:0 (board #0 name eth1 addr 00:30:48:bf:69:4f)
   managed by driver 'igb'

Peer table is ready, mapper is 00:30:48:d6:11:17
================================================
  0) 00:30:48:bf:69:4f c1lsc:0
  1) 00:60:dd:46:ea:ec fb:0
  2) 00:25:90:0d:75:bb c1sus:0
  3) 00:30:48:be:11:5d c1iscex:0
  4) 00:30:48:d6:11:17 c1iscey:0
  5) 00:14:4f:40:64:25 c1ioo:0
controls@c1lsc ~ 0$

This got all the mx_stream connections back up and running.

Unfortunately, daqd is back to being a bit flaky.  With all frame writing enabled we saw daqd crash again.  I then shut off all trend frame writing and we're back to a marginally stable state: we have data flowing from all front ends, and full frames are being written, but not trends.

I'll pick up on this again tomorrow, and maybe try to rebuild the new version of mx with the proper flags.

ELOG V3.1.3-