After determining yesterday that all the daqd issues were coming from the frame writing, I started to dig into it more today. I also spoke to Keith Thorne, and got some good suggestions from Gerrit Kuhn at GEO.
I realized that it probably wasn't the trend writing per se, but that turning on more writing to disk was causing increased load on daqd, and consequently on the system itself. With more frame writing turned on the memory consuption increased to the point of maxing out the physical RAM. The system the probably starting swaping, which certainly would have choked daqd.
I noticed that fb only had 4G of RAM, which Keith suggested was just not enough. Even if the memory consumption of daqd has increased significantly, it still seems like 4G would not be enough. I opened up fb only to find that fb actually had 8G of RAM installed! Not sure what happend to the other 4G, but somehow they were not visible to the system. Koji and I eventually determined, via some frankenstein operations with megatron, that the RAM was just dead. We then pulled 4G of RAM from megatron and replaced the bad RAM in fb, so that fb now has a full 8G of RAM .
Unfortunately, when we got fb fully back up and running we found that fb is not able to see any of the other hosts on the data concentrator network . mx_info, which displays the card and network status for the myricom myrinet fiber card, shows:
MX Version: 1.2.16
MX Build: controls@fb:/opt/src/mx-1.2.16 Tue May 21 10:58:40 PDT 2013
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0: 299.8 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
Status: Running, P0: Wrong Network
Network: Myrinet 10G
MAC Address: 00:60:dd:46:ea:ec
Product code: 10G-PCIE-8AL-S
Part number: 09-03916
Serial number: 352143
Mapper: 00:60:dd:46:ea:ec, version = 0x63e745ee, configured
Mapped hosts: 1
ROUTE COUNT
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- ---
0) 00:60:dd:46:ea:ec fb:0 D 0,0
Note that all front end machines should be listed in the table at the bottom, and they're not. Also note the "Wrong Network" note in the Status line above. It appears that the card has maybe been initialized in a bad state? Or Koji and I somehow disturbed the network when we were cleaning up things in the rack. "sudo /etc/init.d/mx restart" on fb doesn't solve the problem. We even rebooted fb and it didn't seem to help.
In any event, we're back to no data flow. I'll pick up again tomorrow. |