Rolf and Alex came back over with a replacement machine for c1sus. We removed the old machine, removed it's timing, dolphin, and PCIe extension cards and put them in the new machine. We then installed the new machine and booted it and it came up fine. The BIOS in this machine is slightly different, and it wasn't having the same failure-to-boot-with-no-COM issue that the previous one was. The COM ports are turned off on this machine (as is the USB interface).
Unfortunately the problem we were experiencing with the old machine, that unloading certain models was causing others to twitch and that dolphin IPC writes were being dropped, is still there. So the problem doesn't seem to have anything to do with hardware settings...
After some playing, Rolf and Alex determined that for some reason the c1rfm model is coming up in a strange state when started during boot. It runs faster, but the IPC errors are there. If instead all models are stopped, the c1rfm model is started first, and then the rest of the models are started, the c1rfm model runs ok. They don't have an explanation for this, and I'm not sure how we can work around it other than knowing the problem is there and do manual restarts after boot. I'll try to think of something more robust.
A better "fix" to the problems is to clean up all of our IPC routing, a bunch of which we're currently doing very inefficient right now. We're routing things through c1rfm that don't need to be, which is introducing delays. It particular, things that can communicate directly over RFM or dolphin should just do so. We should also figure out if we can put the c1oaf and c1pem models on the same machine, so that they can communicate directly over shared memory (SHMEM). That should cut down on overhead quite a bit. I'll start to look at a plan to do that.
|