I tried one last thing, suggested by Keith and Gerrit. I tried reducing the number of mx end-points on fb to zero, which should reduce the total number of fb threads, in the hope that the extra threads were causing the chokes.
On Tue, Jul 14 2015, Keith Thorne <kthorne@ligo-la.caltech.edu> wrote:
> Assumptions
> 1) Before the upgrade (from RCG 2.6?), the DAQ had been working, reading out front-ends, writing frames trends
> 2) In upgrading to RCG 2.9, the mx start-up on the frame builder was modified to use multiple end-points
> (i.e. /etc/init.d/mx has a line like
> # 1 10G card - X2
> MX_MODULE_PARAMS="mx_max_instance=1 mx_max_endpoints=16 $MX_MODULE_PARAMS"
> (This can be confirmed by the daqd log file with lines at the top like
> 263596
> MX has 16 maximum end-points configured
> 2 MX NICs available
> [Fri Jul 10 16:12:50 2015] ->4: set thread_stack_size=10240
> [Fri Jul 10 16:12:50 2015] new threads will be created with the stack of size 10
> 240K
>
> If this is the case, the problem may be that the additional thread on the frame-builder (one per end-point) take up so many slots on the 8-core
> frame-builder that they interrupt the frame-writing thread, thus preventing the main buffer from being emptied.
>
> One could go back to a single end-point. This only helps keep restart of front-end A from hiccuping DAQ for front-end B.
>
> You would have to remove code on front-ends (/etc/init.d/mx_stream) that chooses endpoints. i.e.
> # find line number in rtsystab. Use that to mx_stream slot on card (0-15)
> line_num=`grep -v ^# /etc/rtsystab | grep --perl-regexp -n "^${hostname}\s" | se
> d 's/^\([0-9]*\):.*/\1/g'`
> line_off=$(expr $line_num - 1)
> epnum=$(expr $line_off % 2)
> cnum=$(expr $line_off / 2)
>
> start-stop-daemon --start --quiet -b -m --pidfile /var/log/mx_stream0.pid --exec /opt/rtcds/tst/x2/target/x2daqdc0/mx_stream -- -e 0 -r "$epnum" -W 0 -w 0 -s "$sys" -d x2daqdc0:$cnum -l /opt/rtcds/tst/x2/target/x2daqdc0/mx_stream_logs/$hostname.log
As per Keith's suggestion, I modified the mx startup script to only initialize a single endpoint, and I modified the mx_stream startup to point them all to endpoint 0. I verified that indeed daqd was a single MX end-point:
MX has 1 maximum end-points configured
It didn't help. After 5-10 minutes daqd crashes with the same "0 empty blocks" messages.
I should also mention that I'm pretty sure the start of these messages does not seem coincident with any frame writing to disk; further evidence that it's not a disk IO issue.
Keith is looking at the system now, so we if he can see anything obvious. If not, I will start reverting to 2.5. |