40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log  Not logged in ELOG logo
Entry  Tue Jun 30 11:33:00 2015, Jamie, Summary, CDS, prepping for CDS upgrade 
    Reply  Wed Jul 1 19:16:21 2015, Jamie, Summary, CDS, CDS upgrade in progress 
       Reply  Tue Jul 7 18:27:54 2015, Jamie, Summary, CDS, CDS upgrade: progress! 2.9-RTS-OK.pngC1X04_GDS_TP.png
          Reply  Wed Jul 8 20:37:02 2015, Jamie, Summary, CDS, CDS upgrade: one step forward, two steps back 
             Reply  Wed Jul 8 21:02:02 2015, Jamie, Summary, CDS, CDS upgrade: another step forward, so we're back to where we started (plus a bit?) 
                Reply  Thu Jul 9 13:26:47 2015, Jamie, Summary, CDS, CDS upgrade: new mx 1.2.16 installed 
                   Reply  Thu Jul 9 16:50:13 2015, Jamie, Summary, CDS, CDS upgrade: if all else fails try throwing metal at the problem 
                      Reply  Mon Jul 13 01:11:14 2015, Jamie, Summary, CDS, CDS upgrade: current assessment 
                         Reply  Mon Jul 13 18:12:50 2015, Jamie, Summary, CDS, CDS upgrade: left running in semi-stable configuration 
                            Reply  Tue Jul 14 09:08:37 2015, Jamie, Summary, CDS, CDS upgrade: left running in semi-stable configuration 
                               Reply  Tue Jul 14 10:28:02 2015, ericq, Summary, CDS, CDS upgrade: left running in semi-stable configuration 
                                  Reply  Tue Jul 14 11:57:27 2015, jamie, Summary, CDS, CDS upgrade: left running in semi-stable configuration 
                            Reply  Tue Jul 14 16:51:01 2015, Jamie, Summary, CDS, CDS upgrade: problem is not disk access 
                               Reply  Wed Jul 15 13:19:14 2015, Jamie, Summary, CDS, CDS upgrade: reducing mx end-points as last ditch effort 
                                  Reply  Wed Jul 15 18:19:12 2015, Jamie, Summary, CDS, CDS upgrade: tentative stabilty? 
                                     Reply  Sat Jul 18 15:37:19 2015, Jamie, Summary, CDS, CDS upgrade: current status cds-good.pngsus-damped.png
Message ID: 11400     Entry time: Thu Jul 9 16:50:13 2015     In reply to: 11398     Reply to this: 11402
Author: Jamie 
Type: Summary 
Category: CDS 
Subject: CDS upgrade: if all else fails try throwing metal at the problem 

I roped Rolf into coming over and adding his eyes to the problem.  After much discussion we couldn't come up with any reasonable explanation for the problems we've been seeing other than daqd just needing a lot more resources that it did before.  He said he had some old Sun SunFire X4600s from which we could pilfer memory.  I went over to Downs and ripped all the CPU/memory cards out of one of his machines and stuffed them into fb:

fb now has 8 CPU and 16G of RAM

Unfortunately, this is still not enough.  Or at least it didn't solve the problem; daqd is showing the same instabilities, falling over a couple of minutes after I turn on trend frame writing.  As always, before daqd fails it starts spitting out the following to the logs:

[Thu Jul  9 16:37:09 2015] main profiler warning: 0 empty blocks in the buffer

followed by lines like:

[Thu Jul  9 16:37:27 2015] GPS MISS dcu 44 (ASX); dcu_gps=1120520264 gps=1120519812

right before it dies.

I'm no longer convinced that this is a resource issue, though, judging by the resource usage right before the crash:

top - 16:47:32 up 48 min,  5 users,  load average: 0.91, 0.62, 0.61
Tasks:   2 total,   0 running,   2 sleeping,   0 stopped,   0 zombie
Cpu(s):  8.9%us,  0.9%sy,  0.0%ni, 89.1%id,  0.9%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  15952104k total, 13063468k used,  2888636k free,   138648k buffers
Swap:  1023996k total,        0k used,  1023996k free,  7672292k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12016 controls  20   0 8098m 4.4g 104m S  106 29.1   6:45.79 daqd
 4953 controls  20   0 53580 6092 5096 S    0  0.0   0:00.04 nds

Load average less than 1 per CPU, plenty of free memory (~3G free, 0 swap), no waiting for IO (0.9%wa), etc.  daqd is utilizing lots of  threads, which should be spread across many cpus, so even the >100%CPU should be ok.   I'm at a loss...

ELOG V3.1.3-