Problem:
The 40m computers were responding sluggishly yesterday, to the point of being unusable.
Cause:
The mx_stream code running on c1iscex (the X end suspension control computer) went crazy for some reason. It was constantly writing to a log file in /cvs/cds/rtcds/caltech/c1/target/fb/192.168.113.80.log. In the past 24 hours this file had grown to approximately 1 Tb in size. The computer had been turned back on yesterday after having reconnected its IO chassis, which had been moved around last week for testing purposes - specifically plugging the c1ioo IO chassis in to it to confirm it had timing problems.
Current Status:
The mx_stream code was killed on c1iscex and the 1 Tb file removed.
Computers are now more usable.
We still need to investigate exactly what caused the code to start writing to the log file non-stop.
Update Edit:
Alex believes this was due to a missing entry in the /diskless/root/etc/hosts file on the fb machine. It didn't list the IP and hostname for the c1iscex machine. I have now added it. c1iscex had been added to the /etc/dhcp/dhcpd.conf file on fb, which is why it was able to boot at all in the first place. With the addition of the automatic start up of mx_streams in the past week by Alex, the code started, but without the correct ip address in the hosts file, it was getting confused about where it was running and constantly writing errors.
Future:
When adding a new FE machine, add its IP address and its hostname to the /diskless/root/etc/hosts file on the fb machine. |