All front ends and model are (mostly) running now

All suspensions are damped:

It should be possible at this point to do more recovery, like locking the MC.
Some details on the restore process:
- all models were recompiled with the new RCG version 3.0.3
- the new RCG does stricter simulink drawing checks, and was complaining about unterminated outputs in some of the SUS models. Terminated all outputs it was concerned about and saved.
- RCG 3.0 requires a new directory for doing better filter module diagnostics: /opt/rtcds/caltech/c1/chans/tmp
- had to reset the slow machines c1susaux, c1auxex, c1auxey
The daqd is not yet running. This is the next task.
I have been taking copious notes and will fully document the restore process once complete.
c1ioo issues
c1ioo has been giving us a little bit of trouble. The c1ioo model kept crashing and taking down the whole c1ioo host. We found a red light on one of the ADCs (ADC1). We pulled the card and replaced it with a spare from the CDS cabinet. That seemed to fix the problem and c1ioo became more stable.
We've still been seeing a lot of glitching in c1ioo, though, with CPU cycle times frequently (every couple of seconds) running above threshold for all models, up to 200 us. I tried unloading every kernel module I could and shutting down every non-critical process, but nothing seemed to help.
We eventually tried stopping the c1ioo model altogether and that seemed to help quite a bit, dropping the long cycle rate down to something like one every 30 seconds or so. Not sure what that means. We should look into the BIOS again, to see if there could be something interacting with the newer kernel.
So currently the c1ioo model is not running (which is why it's all white in the CDS overview snapshot above). The fact that c1ioo is not running and the remaining models are still occaissionly glitching is also causing various IPC errors on auxilliary models (see c1mcs, c1rfm, c1ass, c1asx).
RCG compile warnings
the new RCG tries to do more checks on custom c code, but it seems to be having trouble finding our custom "ccodeio.h" files that live with the c definitions in USERAPPS/*/common/src/. Unclear why yet. This is causing the RCG to spit out warnings like the following:
Cannot verify the number of ins/outs for C function BLRMS.
File is /opt/rtcds/userapps/release/cds/c1/src/BLRMSFILTER.c
Please add file and function to CDS_SRC or CDS_IFO_SRC ccodeio.h file.
This are just warnings and will not prevent the model form compiling or warning. We'll figure out what the problem is to make these go away, but they can be ignored for the time being.
model unload instability
Probably the worst problem we're facing right now is an instability that will occaissionally, but not always, cause the entire front end host to freeze up upon unloading an RTS kernel module. This is a known issue with the newer linux kernels (we're using kernel version 3.2.35), and is being looked into.
This is particularly annoying with the machines on the dolphin network, since if one of the dolphin hosts goes down it manages to crash all the models reading from the dolphin network. Since half the time they can't be cleanly restarted, this tends to cause a boot fest with c1sus, c1lsc, and c1ioo. If this happens, just restart those machines, wait till they've all fully booted, then restart all the models on all hosts with "rtcds start all" . |