Koji did a bit of googling to determine that 'Wrong Network' status message could be explained by the fb myrinet operating in the wrong mode:
(This was the useful link to track down the issue (KA))
Network: Myrinet 10G
I didn't notice it before, but we should in fact be operating in "Ethernet" mode, since that's the fabric we're using for the DC network. Digging a bit deeper we found that the new version of mx (1.2.16) had indeed been configured with a different compile option than the 1.2.15 version had:
controls@fb ~ 0$ grep '$ ./configure' /opt/src/mx-1.2.15/config.log
$ ./configure --enable-ether-mode --prefix=/opt/mx
controls@fb ~ 0$ grep '$ ./configure' /opt/src/mx-1.2.16/config.log
$ ./configure --enable-mx-wire --prefix=/opt/mx-1.2.16
controls@fb ~ 0$
So that would entirely explain the problem. I re-linked mx to the older version (1.2.15), reloaded the mx drivers, and everything showed up correctly:
controls@fb ~ 0$ /opt/mx/bin/mx_info
MX Version: 1.2.12
MX Build: root@fb:/root/mx-1.2.12 Mon Nov 1 13:34:38 PDT 2010
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0: 299.8 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
Status: Running, P0: Link Up
Network: Ethernet 10G
MAC Address: 00:60:dd:46:ea:ec
Product code: 10G-PCIE-8AL-S
Part number: 09-03916
Serial number: 352143
Mapper: 00:60:dd:46:ea:ec, version = 0x00000000, configured
Mapped hosts: 6
ROUTE COUNT
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- ---
0) 00:60:dd:46:ea:ec fb:0 1,0
1) 00:25:90:0d:75:bb c1sus:0 1,0
2) 00:30:48:be:11:5d c1iscex:0 1,0
3) 00:30:48:d6:11:17 c1iscey:0 1,0
4) 00:30:48:bf:69:4f c1lsc:0 1,0
5) 00:14:4f:40:64:25 c1ioo:0 1,0
controls@fb ~ 0$
The front end hosts are also showing good omx info (even though they had been previously as well):
controls@c1lsc ~ 0$ /opt/open-mx/bin/omx_info
Open-MX version 1.5.2
build: controls@fb:/opt/src/open-mx-1.5.2 Tue May 21 11:03:54 PDT 2013
Found 1 boards (32 max) supporting 32 endpoints each:
c1lsc:0 (board #0 name eth1 addr 00:30:48:bf:69:4f)
managed by driver 'igb'
Peer table is ready, mapper is 00:30:48:d6:11:17
================================================
0) 00:30:48:bf:69:4f c1lsc:0
1) 00:60:dd:46:ea:ec fb:0
2) 00:25:90:0d:75:bb c1sus:0
3) 00:30:48:be:11:5d c1iscex:0
4) 00:30:48:d6:11:17 c1iscey:0
5) 00:14:4f:40:64:25 c1ioo:0
controls@c1lsc ~ 0$
This got all the mx_stream connections back up and running.
Unfortunately, daqd is back to being a bit flaky. With all frame writing enabled we saw daqd crash again. I then shut off all trend frame writing and we're back to a marginally stable state: we have data flowing from all front ends, and full frames are being written, but not trends.
I'll pick up on this again tomorrow, and maybe try to rebuild the new version of mx with the proper flags. |