40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 1 of 1  Not logged in ELOG logo
Text:empty blocks in the buffer
ID Date Author Type Category Subject
  12155   Tue Jun 7 20:49:50 2016 jamieUpdateCDSDAQD work ongoing

Summary: new daqd code running overnight test on fb1.  Stability issues persist.

The code is from Keith's "tests/advLigoRTS-40m" branch, which is a branch of the current trunk.  It's supposed to include patches to fix the crashes when multiple frame types are written to disk at the same time.  However, the issue is not fixed:

2016-06-07_20:38:55 about to write frame @ 1149392336
2016-06-07_20:38:55 Begin Full WriteFrame()
2016-06-07_20:38:57 full frame write done in 2seconds
2016-06-07_20:39:11 about to write frame @ 1149392352
2016-06-07_20:39:11 Begin Full WriteFrame()
2016-06-07_20:39:13 full frame write done in 2seconds
2016-06-07_20:39:27 about to write frame @ 1149392368
2016-06-07_20:39:27 Begin Full WriteFrame()
2016-06-07_20:39:29 full frame write done in 2seconds
2016-06-07_20:39:43 about to write second trend frame @ 1149391800
2016-06-07_20:39:43 Begin second trend WriteFrame()
2016-06-07_20:39:43 about to write frame @ 1149392384
2016-06-07_20:39:43 Begin Full WriteFrame()
2016-06-07_20:39:44 full frame write done in 1seconds
2016-06-07_20:39:59 about to write frame @ 1149392400
2016-06-07_20:40:04 Begin Full WriteFrame()
2016-06-07_20:40:04 Second trend frame write done in 21 seconds
2016-06-07_20:40:14 [Tue Jun  7 20:40:14 2016] main profiler warning: 1 empty blocks in the buffer
2016-06-07_20:40:15 [Tue Jun  7 20:40:15 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:16 [Tue Jun  7 20:40:16 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:17 [Tue Jun  7 20:40:17 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:18 [Tue Jun  7 20:40:18 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:19 [Tue Jun  7 20:40:19 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:20 [Tue Jun  7 20:40:20 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:21 [Tue Jun  7 20:40:21 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:22 [Tue Jun  7 20:40:22 2016] main profiler warning: 0 empty blocks in the buffer
2016-06-07_20:40:23 [Tue Jun  7 20:40:23 2016] main profiler warning: 0 empty blocks in the buffer

This failure comes when a full frame (1149392384+16) is written to disk at the same time as a second trend (1149391800+600).  It seems like every time this happens daqd crashes.

I have seen other stability issues as well, maybe caused by mx flakiness, or some sort of GPS time synchronization issue caused by our lack of IRIG-B cards.  I'm going to look to see if I can get the GPS issue taken care of so we take that out of the picture.

For the last couple of hours I've only seen issues with the frame writing every 20 minutes, when the full and second trend frames happen to be written at the same time.  Running overnight to gather more statistics.

  11868   Wed Dec 9 19:01:45 2015 jamieUpdateCDSback to fb1

I spent this afternoon trying to debug fb1, with very little to show for it.  We're back to running from fb.

The first thing I did was to recompile EPICS from source, so that all the libraries needed by daqd were compiled for the system at hand.  I compiled epics-3.14-12-2_long from source, and installed it at /opt/rtapps/epics on local disk, not on the /opt/rtapps network mount.  I then recompiled daqd against that, and the framecpp, gds, etc from the LSCSoft packages.  So everything has been compiled for this version of the OS.  The compilation goes smoothly.

There are two things that I see while running this new daqd on fb1:

instability with mx_streams

The mx stream connection between the front ends and the daqd is flaky.  Everything will run fine for a while, the spontaneously one or all of the mx_stream processes on the front ends will die.  It appears more likely that all mx_stream processes will die at the same time.  It's unclear if this is some sort of chain reaction thing, or if something in daqd or in the network itself is causing them all to die at the same time.  It is independent of whether or not we're using multiple mx "end points" (i.e. a different one for each front end and separate receiver threads in the daqd) or just a single one (all front ends connecting to a single mx receiver thread in daqd).

Frequently daqd will recover from this.  The monit processes on the front ends restart the mx_stream processes and all will be recovered.  However occaissionally, possibly if the mx_streams do not recover fast enough (which seems to be related to how frequently the receiver threads in daqd can clear themselves), daqd will start to choke and will start spitting out the "empty blocks" messages that are harbirnger of doom:

Aborted 2 send requests due to remote peer 00:30:48:be:11:5d (c1iscex:0) disconnected
00:30:48:d6:11:17 (c1iscey:0) disconnected
mx_wait failed in rcvr eid=005, reqn=182; wait did not complete; status code is Remote endpoint is closed
disconnected from the sender on endpoint 005
mx_wait failed in rcvr eid=001, reqn=24; wait did not complete; status code is Remote endpoint is closed
disconnected from the sender on endpoint 001
[Wed Dec  9 18:40:14 2015] main profiler warning: 1 empty blocks in the buffer
[Wed Dec  9 18:40:15 2015] main profiler warning: 0 empty blocks in the buffer
[Wed Dec  9 18:40:16 2015] main profiler warning: 0 empty blocks in the buffer

My suspicion is that this time of failure is tied to the mx stream failures, so we should be looking at the mx connections and network to solve this problem.

frame writing troubles

There's possibly a separate issue associated with writing the second or minute trend files to disk.  With fair regularity daqd will die soon after it starts to write out the trend frames, producing the similar "empty blocks" messages.

  11664   Sun Oct 4 14:28:03 2015 jamieUpdateDAQmore failed attempts at getting new fb working

I tried to look at fb1 again today, but still haven't made any progress.

The one thing I did notice, though, is that every hour on the hour the fb1 daqd process dies in an identical manor to how the fb daqd dies, with these:

[Sun Oct  4 12:02:56 2015] main profiler warning: 0 empty blocks in the buffer

errors right as/after it tries to write out the minute trend frames.

This makes me think that this new hardware isn't actually going to fix the problem we've been seeing with the fb daqd, even if we do get daqd "working" on fb1 as well as it's currently working on fb.

  11655   Thu Oct 1 19:49:52 2015 jamieUpdateDAQmore failed attempts at getting new fb working

Summary

I've not really been able to make additional progress with the new 'fb1' DAQ.  It's still flaky as hell.  Therefore we're still using old 'fb'.

Issues

mx_stream

The mx_stream processes on the front ends initially run fine, connecting to the daqd and transferring data, with both DAQ-..._STATUS and FE-..._FB_NET_STATUS indicators green.  Then after about two minutes all the mx_stream processes on all the front ends die.  Monit eventually restarts them all, at which point they come up green for a while until the crash again ~2 minutes later.  This is essentially the same situation as reported previously.

In the daqd logs when the mx_streams die:

Aborted 2 send requests due to remote peer 00:30:48:be:11:5d (c1iscex:0) disconnected
Aborted 2 send requests due to remote peer 00:14:4f:40:64:25 (c1ioo:0) disconnected
Aborted 2 send requests due to remote peer 00:30:48:d6:11:17 (c1iscey:0) disconnected
Aborted 2 send requests due to remote peer 00:25:90:0d:75:bb (c1sus:0) disconnected
Aborted 1 send requests due to remote peer 00:30:48:bf:69:4f (c1lsc:0) disconnected
mx_wait failed in rcvr eid=000, reqn=176; wait did not complete; status code is Remote endpoint is closed
disconnected from the sender on endpoint 000
mx_wait failed in rcvr eid=000, reqn=177; wait did not complete; status code is Connectivity is broken between the source and the destination
disconnected from the sender on endpoint 000
mx_wait failed in rcvr eid=000, reqn=178; wait did not complete; status code is Connectivity is broken between the source and the destination
disconnected from the sender on endpoint 000
mx_wait failed in rcvr eid=000, reqn=179; wait did not complete; status code is Connectivity is broken between the source and the destination
disconnected from the sender on endpoint 000
mx_wait failed in rcvr eid=000, reqn=180; wait did not complete; status code is Connectivity is broken between the source and the destination
disconnected from the sender on endpoint 000
[Thu Oct  1 19:00:09 2015] GPS MISS dcu 39 (PEM); dcu_gps=1127786407 gps=1127786425

[Thu Oct  1 19:00:09 2015] GPS MISS dcu 39 (PEM); dcu_gps=1127786408 gps=1127786426

[Thu Oct  1 19:00:09 2015] GPS MISS dcu 39 (PEM); dcu_gps=1127786408 gps=1127786426

In the mx_stream logs:

controls@c1iscey ~ 0$ /opt/rtcds/caltech/c1/target/fb/mx_stream -r 0 -W 0 -w 0 -s 'c1x05 c1scy c1tst' -d fb1:0
mmapped address is 0x7f0df23a6000
mmapped address is 0x7f0dee3a6000
mmapped address is 0x7f0dea3a6000
send len = 263596
Connection Made
isendxxx failed with status Remote Endpoint Unreachable
disconnected from the sender

daqd

While the mx_stream processes are running daqd seems to write out data just fine.  At least for the full frames.  I manually verified that there is indeed data in the frames that are written.

Eventually, though, daqd itself crashes with the same error that we've been seeing:

main profiler warning: 0 empty blocks in the buffer

I'm not exactly sure what the crashes are coincident with, but it looks like they are also coincident with the writing out of the minute and/or second trend files.  It's unclear how it's related to the mx_stream crashes, if at all.  The mx_stream crashes happen every couple of minutes, whereas the daqd itself crashes much less frequently.

The new daqd can't handle EDCU files.  If an EDCU file is specified (e.g. C0EDCU.ini in our case), the daqd will segfault very soon after startup.  This was an issue with the current daqd on fb, but was "fixed" by moving where the EDCU file was specified in the master file.

Conclusion

There are a number of differences between the fb1 and fb configurations:

  • newer OS (Debian 7 vs. ancient gentoo)
  • newer advLigoRTS (trunk vs. 2.9.4)
  • newer framecpp library installed from LSCSoft Debian repo (2.4.1-1+deb7u0 vs. 1.19.32-p1)

It's possible those differences could account for the problems (/opt/rtapps/epics incompatible with this Debian install, for instance).  Somehow I doubt it.  I wonder if all the weird network issues we've been seeing are somehow involved.  If the NFS mount of chiara is problematic for some reason that would affect everything that mounts it, which includes all the front ends and fb/fb1.

There are two things to try:

  • Fix the weird network problem.  Try remove EVERYTHING from the network except for chiara, fb/fb1, and the front ends and see if that helps.
  • Rebuild fb1 with Ubuntu and daqd as prescribed by Keith Thorne.
  11412   Tue Jul 14 16:51:01 2015 JamieSummaryCDSCDS upgrade: problem is not disk access

I think I have now determined once and for all that the daqd problems are NOT due to disk IO contention.

I have mounted a tmpfs at /frames/tmp and have told daqd to write frames there.  The tmpfs exists entirely in RAM.  There is essentially zero IO wait for such a filesystem, so daqd should never have trouble writing out the frames.

But yet daqd continues to fail with the "0 empty blocks in the buffer" warnings.  I've been down a rabbit hole.

  11400   Thu Jul 9 16:50:13 2015 JamieSummaryCDSCDS upgrade: if all else fails try throwing metal at the problem

I roped Rolf into coming over and adding his eyes to the problem.  After much discussion we couldn't come up with any reasonable explanation for the problems we've been seeing other than daqd just needing a lot more resources that it did before.  He said he had some old Sun SunFire X4600s from which we could pilfer memory.  I went over to Downs and ripped all the CPU/memory cards out of one of his machines and stuffed them into fb:

fb now has 8 CPU and 16G of RAM

Unfortunately, this is still not enough.  Or at least it didn't solve the problem; daqd is showing the same instabilities, falling over a couple of minutes after I turn on trend frame writing.  As always, before daqd fails it starts spitting out the following to the logs:

[Thu Jul  9 16:37:09 2015] main profiler warning: 0 empty blocks in the buffer

followed by lines like:

[Thu Jul  9 16:37:27 2015] GPS MISS dcu 44 (ASX); dcu_gps=1120520264 gps=1120519812

right before it dies.

I'm no longer convinced that this is a resource issue, though, judging by the resource usage right before the crash:

top - 16:47:32 up 48 min,  5 users,  load average: 0.91, 0.62, 0.61
Tasks:   2 total,   0 running,   2 sleeping,   0 stopped,   0 zombie
Cpu(s):  8.9%us,  0.9%sy,  0.0%ni, 89.1%id,  0.9%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  15952104k total, 13063468k used,  2888636k free,   138648k buffers
Swap:  1023996k total,        0k used,  1023996k free,  7672292k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12016 controls  20   0 8098m 4.4g 104m S  106 29.1   6:45.79 daqd
 4953 controls  20   0 53580 6092 5096 S    0  0.0   0:00.04 nds

Load average less than 1 per CPU, plenty of free memory (~3G free, 0 swap), no waiting for IO (0.9%wa), etc.  daqd is utilizing lots of  threads, which should be spread across many cpus, so even the >100%CPU should be ok.   I'm at a loss...

  11301   Mon May 18 16:28:18 2015 ericqUpdateGeneralsome status
Quote:

4) Noticed that DAQD is restarting once per hour on the hour. Why?

It looks like daqd isn't being restarted, but in fact crashing every hour.

Going into the logs in target/fb/logs/old, it looks like at 10 seconds past the hour, every hour, daqd starts spitting out:

[Mon May 18 12:00:10 2015] main profiler warning: 1 empty blocks in the buffer                                     
[Mon May 18 12:00:11 2015] main profiler warning: 0 empty blocks in the buffer                                     
[Mon May 18 12:00:12 2015] main profiler warning: 0 empty blocks in the buffer                                     
[Mon May 18 12:00:13 2015] main profiler warning: 0 empty blocks in the buffer
...
***CRASH***

An ELOG search on this kind of phrase will get you a lot of talk about FB transfer problems. 

I noticed the framebuilder had 100% usage on its internal, non-RAID, non /frames/, HDD, which hosts the root filesystem (OS files, home directory, diskless boot files, etc), largely due to a ~110GB directory of frames from our first RF lock that had been copied over to the home directory. The HDD only has 135GB capacity. I thought that maybe this was somehow a bottleneck for files moving around, but after deleting the huge directory, daqd still died at 4PM. 

The offsite LDAS rsync happens at ten minutes past the hour, so is unlikely to be the culprit. I don't have any other clues at this point. 

  10632   Wed Oct 22 21:06:33 2014 ChrisUpdateDAQ40m frames onto the cluster

Quote:

 Dan Kozak is rsync transferring /frames from NODUS over to the LDAS grid. He's doing this without a BW limit, but even so its going to take a couple weeks. If nodus seems pokey or the net connection to the outside world is too tight, then please let me and him know so that he can throttle the pipe a little.

The recently observed daqd flakiness looks related to this transfer. It appears to still be ongoing:

nodus:~>ps -ef | grep rsync
controls 29089   382  5 13:39:20 pts/1   13:55 rsync -a --inplace --delete --exclude lost+found --exclude .*.gwf /frames/trend
controls 29100   382  2 13:39:43 pts/1    9:15 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10975 131.
controls 29109   382  3 13:39:43 pts/1    9:10 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10978 131.
controls 29103   382  3 13:39:43 pts/1    9:14 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10976 131.
controls 29112   382  3 13:39:43 pts/1    9:18 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10979 131.
controls 29099   382  2 13:39:43 pts/1    9:14 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10974 131.
controls 29106   382  3 13:39:43 pts/1    9:13 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10977 131.
controls 29620 29603  0 20:40:48 pts/3    0:00 grep rsync

Diagnosing the problem:

I logged into fb and ran "top". It said that fb was waiting for disk I/O ~60% of the time (according to the "%wa" number in the header). There were 8 nfsd (network file server) processes running with several of them listed in status "D" (waiting for disk). The daqd logs were ending with errors like the following suggesting that it couldn't keep up with the flow of data:

[Wed Oct 22 18:58:35 2014] main profiler warning: 1 empty blocks in the buffer
[Wed Oct 22 18:58:36 2014] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1098064730 to 1098064731

This all pointed to the possibility that the file transfer load was too heavy.

Reducing the load:

The following configuration changes were applied on fb.

Edited /etc/conf.d/nfs to reduce the number of nfsd processes from 8 to 1:

OPTS_RPC_NFSD="1"

(was "8")

Ran "ionice" to raise the priority of the framebuilder process (daqd):

controls@fb /opt/rtcds/rtscore/trunk/src/daqd 0$ sudo ionice -c 1 -p 10964

And to reduce the priority of the nfsd process:

controls@fb /opt/rtcds/rtscore/trunk/src/daqd 0$ sudo ionice -c 2 -p 11198

I also tried punishing nfsd with an even lower priority ("-c 3"), but that was causing the workstations to lag noticeably.

After these changes the %wa value went from ~60% to ~20%, and daqd seems to die less often, but some further throttling may still be in order.

  10617   Thu Oct 16 12:22:43 2014 ericqUpdateCDSDaqd segfaulting again

I've been trying to figure out why daqd keeps crashing, but nothing is fixed yet. 

I commented out the line in /etc/inittab that runs daqd automatically, so I could run it manually. Each time I run it ( with ./daqd -c ./daqdrc while in c1/target/fb), it churns along fine for a little while, but eventually spits out something like:

[Thu Oct 16 12:07:23 2014] main profiler warning: 1 empty blocks in the buffer
[Thu Oct 16 12:07:24 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 12:07:25 2014] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1097521658 to 1097521660
Segmentation fault
 
Or:
 
[Thu Oct 16 11:43:54 2014] main profiler warning: 1 empty blocks in the buffer
[Thu Oct 16 11:43:55 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:56 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:57 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:58 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:43:59 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:44:00 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:44:01 2014] main profiler warning: 0 empty blocks in the buffer
[Thu Oct 16 11:44:02 2014] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1097520250 to 1097520257
FATAL: exception not rethrown
Aborted

I looked for time disagreements between the FB and the frontends, but they all seem fine. Running ntpdate only corrected things by 5ms. However, looking through /var/log/messages on FB, I found that ntp claims to have corrected the FB's time by ~111600 seconds (~31 hours) when I rebooted it on Monday.

Maybe this has something to do with the timing that the FB is getting? The FE IOPs seem happy with their sync status, but I'm not personally currently aware of how the FB timing is set up. 


Addendum:

On Monday, Jamie suggested checking out the situation with FB's RAID. Searching the elog for "empty blocks in the buffer" also brought up posts that mentioned problems with the RAID. 

I went to the JetStor RAID web interface at http://192.168.113.119, and it reports everything as healthy; no major errors in the log. Looking at the SMART status of a few of the drives shows nothing out of the ordinary. The RAID is not mounted in read-only mode either, as was the problem mentioned in previous elogs. 

  8278   Tue Mar 12 12:06:22 2013 JamieUpdateComputersFB recovered, RAID power supply #1 dead

The framebuilder RAID is back online.  The disk had been mounted read-only (see below) so daqd couldn't write frames, which was in turn causing it to segfault immediately, so it was constantly restarting.

The jetstor RAID unit itself has a dead power supply.  This is not fatal, since it has three.  It has three so it can continue to function if one fails.  I have removed the bad supply and gave it to Steve so he can get a suitable replacement.

Some recovery had to be done on fb to get everything back up and running again.  I ran into issues trying to do it on the fly, so I eventually just rebooted.  It seemed to come back ok, except for something going on with daqd.  It was reporting the following error upon restart:

[Tue Mar 12 11:43:54 2013] main profiler warning: 0 empty blocks in the buffer

It was spitting out this message about once a second, until eventually the daqd died.  When it restarted it seemed to come back up fine.  I'm not exactly clear what those messages were about, but I think it has something to do with not being able to dump it's data buffers to disk.  I'm guessing that this was a residual problem from the umounted /frames, which somehow cleared on it's own.  Everything seems to be ok now.

Quote:

Manasa just went inside to recenter the AS beam on the camera after our Yarm spot centering exercises of the evening, and heard a loud beeping.  We determined that it is the RAID attached to the framebuilder, which holds all of our frame data that is beeping incessantly.  The top center power switch on the back (there are FOUR power switches, and 3 power cables, btw.  That's a lot) had a red light next to it, so I power cycled the box.  After the box came back up, it started beeping again, with the same front panel message:

H/W monitor power #1 failed.

DO NOT DO THIS.  This is what caused all the problems.  The unit has three redundant power supplies, for just this reason.  It was probably continuing to function fine.  The beeping was just to tell you that there was something that needed attention.  Rebooting the device does nothing to solve the problem.  Rebooting in an attempt to silence beeping is not a solution.  Shutting of the RAID unit is basically the equivalent of ripping out a mounted external USB drive.  You can damage the filesystem that way.  The disk was still functioning properly.  As far as I understand it the only problem was the beeping, and there were no other issues.  After you hard rebooted the device, fb lost it's mounted disk and then went into emergency mode, which was to remount the disk read-only.  It didn't understand what was going on, only that the disk seemed to disappear and the reappear.  This was then what caused the problems.  It was not the beeping, it was the restarting the RAID that was mounted on fb.

Computers are not like regular pieces of hardware.  You can't just yank the power on them.  Worse yet is yanking the power on a device that is connected to a computer.  DON"T DO THIS UNLESS YOU KNOW WHAT YOU"RE DOING.  If the device is a disk drive, then doing this is a sure-fire way to damage data on disk.

 

  6555   Sat Apr 21 20:40:28 2012 JamieUpdateCDSframebuilder frame writing working again

Quote:

A big one is that the framebuilder daqd started going haywire when I told it to start writing frames.  After restart the logs started showing this:

[Fri Apr 20 17:23:40 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:41 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:42 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:43 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:44 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:45 2012] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1019002442 to 1019003041
FATAL: exception not rethrown
FATAL: exception not rethrown
FATAL: exception not rethrown

and the network seemed like it started to get really slow.  I wasn't able to figure out what was going on, so I shut the frame writing off again.  I'll have to work with Rolf on that next week.

So the framebuilder/daqd frame writing issue seems to have been a transient one.  Alex said he tried enabling frame writing manually and it worked fine, so I tried re-enabling the frame writing config lines and sure enough it worked fine.  So it's a mystery.  Alex said the "main profiler warning" lines tend to show up when data is backed up in the buffer, although he didn't explain why exactly that would have been the issue here.

daqdrc frame writing directives:

start frame-saver;
sync frame-saver;
start trender;
start trend-frame-saver;
sync trend-frame-saver;
start minute-trend-frame-saver;
sync minute-trend-frame-saver;
start raw_minute_trend_saver;
  6552   Fri Apr 20 19:54:57 2012 JamieUpdateCDSCDS upgrade problems

I ran into a couple of snags today.

A big one is that the framebuilder daqd started going haywire when I told it to start writing frames.  After restart the logs started showing this:

[Fri Apr 20 17:23:40 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:41 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:42 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:43 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:44 2012] main profiler warning: 0 empty blocks in the buffer
[Fri Apr 20 17:23:45 2012] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1019002442 to 1019003041
FATAL: exception not rethrown
FATAL: exception not rethrown
FATAL: exception not rethrown

and the network seemed like it started to get really slow.  I wasn't able to figure out what was going on, so I shut the frame writing off again.  I'll have to work with Rolf on that next week.

Another big problem is the workstation application upgrades.  The NDS protocol version has been incremented, which means that all the NDS client applications have to be upgraded.  The new dataviewer is working fine (on pianosa), but dtt is not:

controls@pianosa:~ 0$ diaggui
diaggui: symbol lookup error: /ligo/apps/linux-x86_64/gds-2.15.1/lib/libligogui.so.0: undefined symbol: _ZN18TGScrollBarElement11ShowMembersER16TMemberInspector
controls@pianosa:~ 127$ 

I don't know what's going on here.  All the library paths are ok.  Hopefully I'll be able to figure this out soon.  The old version of dtt definitely does not work with the new setup.

I might go ahead and upgrade some more of the workstations to Ubuntu in the next couple of days as well, so everything is more on the same page.

I also tried to cleanup the front-end boot process, which has it's own problems (models won't auto-start).  I haven't figured that out yet either.  It really needs to just be completely overhauled.

  5896   Tue Nov 15 15:56:23 2011 jamieUpdateCDSdataviewer doesn't run

Quote:

Dataviewer is not able to access to fb somehow.

I restarted daqd on fb but it didn't help.

Also the status screen is showing a blank while form in all the realtime model. Something bad is happening.

 So something very strange was happening to the framebuilder (fb).  I logged on the fb and found this being spewed to the logs once a second:

[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 15:28:51 2011] going down on signal 11
sh: /bin/gcore: No such file or directory

Apparently /bin/gcore was trying to be called by some daqd subprocess or thread, and was failing since that file doesn't exist.  This apparently started at around 5:52 AM last night:

[Tue Nov 15 05:46:52 2011] main profiler warning: 1 empty blocks in the buffer
[Tue Nov 15 05:46:53 2011] main profiler warning: 0 empty blocks in the buffer
[Tue Nov 15 05:46:54 2011] main profiler warning: 0 empty blocks in the buffer
[Tue Nov 15 05:46:55 2011] main profiler warning: 0 empty blocks in the buffer
[Tue Nov 15 05:46:56 2011] main profiler warning: 0 empty blocks in the buffer
...
[Tue Nov 15 05:52:43 2011] main profiler warning: 0 empty blocks in the buffer
[Tue Nov 15 05:52:44 2011] main profiler warning: 0 empty blocks in the buffer
[Tue Nov 15 05:52:45 2011] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1005400026 to 1005400379
[Tue Nov 15 05:52:46 2011] going down on signal 11
sh: /bin/gcore: No such file or directory
[Tue Nov 15 05:52:46 2011] going down on signal 11
sh: /bin/gcore: No such file or directory

The gcore I believe it's looking for is a debugging tool that is able to retrieve images of running processes.  I'm guessing that something caused something int the fb to eat crap, and it was stuck trying to debug itself.  I can't tell what exactly happend, though.  I'll ping the CDS guys about it.  The daqd process was continuing to run, but it was not responding to anything, which is why it could not be restarted via the normal means, and maybe why the various FB0_*_STATUS channels were seemingly dead.

I manually killed the daqd process, and monit seemed to bring up a new process with no problem.  I'll keep an eye on it.

ELOG V3.1.3-