ID |
Date |
Author |
Type |
Category |
Subject |
352
|
Mon Mar 3 13:58:10 2008 |
steve | Update | Computers | RFM Network are down |
The CODAQ_RFNETWORK are down, except C1SUSVME & AWG |
353
|
Mon Mar 3 19:34:40 2008 |
rana | Update | Computers | RFM Network are down |
Quote: | The CODAQ_RFNETWORK are down, except C1SUSVME & AWG |
All of the FE machines were found to be down this afternoon. I called Alex and he suggested several
things which didn't work (restart EPICS tasks, power cycle RFM switch, etc.).
Then he suggested that I go around and power cycle every crate!!! And that sometimes the order of this
matters!!! I think he was just recording this conversation so that he could have a laugh by playing it on Youtube.
However, power cycling all of the FE crates seemed to work. Alex's theory is that 'something goes bad in the
RFM cards sometimes'.
Its all green now. |
354
|
Tue Mar 4 00:42:51 2008 |
rana | Update | Computers | FB0 still down ? |
The framebuilder is still down. I tried restarting the daqd task and resetting the RFM
switch like it says in the Wiki but it still doesn't work right. The computer itself is
running (I can ssh to it) and the daqd process is running but there's a red light for
it on the RFM screen and dataviewer won't connect to it.
If Alex isn't over by ~10 AM, we should call him and ask for help. |
355
|
Tue Mar 4 10:08:21 2008 |
rob | Update | Computers | green lights unreliable when c0daqctrl down |
So far I've tried powering off the framebuilder, power-cycling the RAID (it was showing an error message about bad IDE channel #4), and rebooting the LSC (just for fun). When I reset the LSC, its green light on the RFM_NETWORK screen did not turn red, making all these lights suspect. The iscepics40m process is what controls these red/green lights, so maybe it's gone wonky. It appears to be running however, on c1dcuepics, and it also seems to be functioning correctly in other ways (it's communicating correctly with the LSC).
Update: Alex and Jay came by. The solution was to reset the c0daqctrl processor, which apparently was not done in Rana's rebooting spree. Or maybe it needed to be done last. |
358
|
Tue Mar 4 23:22:32 2008 |
rob | DAQ | Computers | c1susvme1&2 rebooted |
I found that some channels from c1susvme1 and c1susvme2 were not being recording by the DAQ (and were not showing up in DV). I rebooted these processors, which fix the problem. If you see other cases of this (signal exactly zero, but not a testpoint problem), just reboot the corresponding processor. |
364
|
Fri Mar 7 17:10:01 2008 |
Max Jones | Update | Computers | Noise Budget work |
Noise budget has been moved to the svn system. A checked out copy is in the directory caltech. From now on, I will try to use the work cycle as outlined in the svn manual. Changes made today include the following:
getNoiseBudget
/matlab/noise/NoiseBudget
Details of the modifications made may be found on the svn system. Please let me know if anyone has a suggestion or concern. Thank you - Max. |
382
|
Fri Mar 14 16:56:03 2008 |
Dmass | Bureaucracy | Computers | New 40m control machine. |
I priced out a new control machine from Dell and had Steve buy it.
GigE cards (jumbo packet capable) will be coming seperately.
Specs:
Quad core (2+GHz)
4 Gigs @ 800MHz RAM
24" LCD
low end video card (Nvidia 8300 - analog + digital output for dual head config)
No floppy drive on this one (yet?) |
399
|
Mon Mar 24 20:15:03 2008 |
John | Summary | Computers | c1susvme2 |
c1susvme2 isn't behaving itself. It keeps getting out of sync and/or giving a red status light.
After going through the usual restart procedures a few times (unsuccessfully) we power cycled the c1susvme & c1sosvme crates. We think everything came back okay.
We still can't get the status and CRC (cyclic redundancy check) to return to normal on c1susvme2. If Alex is around tomorrow please ask him to take a look. |
400
|
Tue Mar 25 10:44:24 2008 |
rob | Update | Computers | c1susvme2 |
Quote: | c1susvme2 isn't behaving itself. It keeps getting out of sync and/or giving a red status light.
After going through the usual restart procedures a few times (unsuccessfully) we power cycled the c1susvme & c1sosvme crates. We think everything came back okay.
We still can't get the status and CRC (cyclic redundancy check) to return to normal on c1susvme2. If Alex is around tomorrow please ask him to take a look. |
I rebooted it again this morning. The ASS machine is currently not running its process, for whatever reason (someone turn it off?). Let's leave it like this for a day and see how the c1susvme2 does. The other recent change is Steve's install of a cooling fan--maybe that's causing the problem. |
401
|
Tue Mar 25 13:21:25 2008 |
Andrey | Update | Computers | c1susvme2 is not behaving itself again |
|
403
|
Tue Mar 25 16:34:47 2008 |
rob | Update | Computers | c1susvme2 |
Quote: |
Quote: | c1susvme2 isn't behaving itself. It keeps getting out of sync and/or giving a red status light.
After going through the usual restart procedures a few times (unsuccessfully) we power cycled the c1susvme & c1sosvme crates. We think everything came back okay.
We still can't get the status and CRC (cyclic redundancy check) to return to normal on c1susvme2. If Alex is around tomorrow please ask him to take a look. |
I rebooted it again this morning. The ASS machine is currently not running its process, for whatever reason (someone turn it off?). Let's leave it like this for a day and see how the c1susvme2 does. The other recent change is Steve's install of a cooling fan--maybe that's causing the problem. |
Now c1susvme1 is joining the action. Since leaving the ASS off doesn't change anything, we can probably absolve it of blame. I now suspect the 4-pin LEMO cables going from the CLK DRIVER modules to the clock fanout modules. These cables are being squeezed/shaken by Steve's new fan setup, and may have been the culprit all along. John will do some testing to see if they are indeed the problem. |
405
|
Wed Mar 26 22:26:15 2008 |
John | Update | Computers | c1susvme |
I removed the fan and tweaked the timing cables to see if they were the source of our problems. I saw no effect. I'm leaving the fan off for the moment to see if that helps. It is on top of the filing cabinet next to my desk. |
406
|
Fri Mar 28 16:18:18 2008 |
rob | Update | Computers | c1susvme2 status |
c1susvme2 is getting worse and worse. it won't run for more than ~45 minutes without fatally de-syncing. for now I've turned off c1iovme (which sends the MCL signal) to see if that's causing the problem. next I'll swap the boards for c1susvme1 and c1susvme2 to see if it's the cpu (or maybe the RFM card) itself, rather than the timing/pentek systems. |
408
|
Mon Mar 31 14:14:16 2008 |
rob | Update | Computers | c1susvme2 status |
Quote: | c1susvme2 is getting worse and worse. it won't run for more than ~45 minutes without fatally de-syncing. for now I've turned off c1iovme (which sends the MCL signal) to see if that's causing the problem. next I'll swap the boards for c1susvme1 and c1susvme2 to see if it's the cpu (or maybe the RFM card) itself, rather than the timing/pentek systems. |
I swapped the processors for c1susvme1 and c1susvme2. So for now, to startup, you should ssh into c1susvme1 and run the startup.cmd for c1susvme2, and vice versa. |
412
|
Thu Apr 3 18:46:04 2008 |
Andrey | Configuration | Computers | "Network switch board" and "c1pem1 crate" were touched |
While working with the weather station, I did two things that potentially (with a very small probability) might influence the smooth work of other processors/computers.
I did the following on Wednesday, April 2nd, in times between 1PM and 3PM.
(1) I turned off for several seconds and returned into the initial position the switch-key on the rack with computer (processor) 'c1pem1' in order to reboot processor 'c1pem1'. The turning off/on of that key-switch was repeated several times.
(2) I pulled gently the whole "Network-Switch Board" towards me in order to replace an ethernet cat 5 cable going into the board form the processor 'c1pem1'. Some other connections of other ethernet cables might be flimsy, and then other people in 40-meter might have problems with computers other than 'c1pem1'. It should not happen, but in case of extraordinary behaviour of any other computer in our lab, people should check the connectors on the network-switch board. It is located near the middle of Y-arm. See picture. |
Attachment 1: Computer_Rack.JPG
|
|
416
|
Mon Apr 7 16:42:56 2008 |
rana | Update | Computers | eLog intermittent |
Phil Ehrens restarted Dziban again this morning. Looks like its still crashing each Monday around 8 AM.
Here is the latest suspect:
http://open.itworld.com/5040/reboot-unix-nlsunix-071225/page_1.html |
419
|
Tue Apr 15 18:44:25 2008 |
rana | Configuration | Computers | Rosalba |
There is a new computer in the control room -- its called Rosalba,
in keeping with our naming convention. Its a quad-core machine that
Dmass found for cheap somewhere; we've installed the CentOS on it
that Alex recommended.
Its a 64-bit Linux and so that's going to cause some problems. Alex has done this
before and so we have some confidence that we can get our regular tools (DTT, Dataviewer)
to run on it.
I have made a new apps tree for all of our future 64-bit Linux machines. So far, there is
a 64-bit firefox and a 64-bit matlab in there. As we start using this machine some more, we
will be forced to install more 64-bit Linux stuff.
We also didn't have enough network cables to run to both linux3 and rosalba. Andrey has decided that we
should not ditch linux3 and so he will run another cable for it tomorrow. |
421
|
Wed Apr 16 10:20:01 2008 |
Andrey | Update | Computers | Rosalba and linux3 |
Quote: | There is a new computer in the control room -- its called Rosalba,
in keeping with our naming convention. Its a quad-core machine that
Dmass found for cheap somewhere; we've installed the CentOS on it
that Alex recommended.
Its a 64-bit Linux and so that may cause some problems. Alex has done this
before and so we have some confidence that we can get our regular tools (DTT, Dataviewer)
to run on it.
I have made a new apps tree for all of our future 64-bit Linux machines. So far, there is
a 64-bit firefox and a 64-bit matlab in there. As we start using this machine some more, we
will be forced to install more 64-bit Linux stuff.
We also didn't have enough network cables to run to both linux3 and rosalba. Andrey has decided that we
should not ditch linux3 and so he will run another cable for it tomorrow. |
The ethernet cable for linux3 was installed on Wednesday morning. Now linux3 has Internet connection again. |
444
|
Thu Apr 24 22:06:47 2008 |
Andrey | Summary | Computers | Ethernet Cables and Hubs |
Today in the morning (between 8.30AM and noon) Joe and I were working on understanding which ethernet cables connect "processors controlling the work of equipment in the interferometer room" and "Internet hub in the computer room".
Firstly, we took off several times the blue ethernet cables from the router located near ETMX in the morning. We were trying to understand which port in the hub is responsible for the interaction with that processor.
Secondly, we were working on reviving the connection with the computer controlling vacuum in the interferometer.
Later in the middle of the day (around 2PM) Joe continued some work with ethernet cables without me. We plan on continuing the cable work on Friday morning. A better and more detailed elog will appear then. |
447
|
Fri Apr 25 11:33:40 2008 |
Andrey | Configuration | Computers | Computer controlling vaccum equipment |
Old computer (located in the south-end end of the interferometer room) that was almost unable to fulfill his duties of controlling vacuum equipment has been replaced to "Linux-3". MEDM runs on "Linux-3".
We checked later that day together with Steve Vass that vacuum equipment (like vacuum valves) can be really controlled from the MEDM-screen 'VacControl.adl'.
Unused flat LCD monitor, keyboard and mouse (parts of the former LINUX-3 computer) were put on the second shelf of the computer rack in the computer room near the HP printer. |
449
|
Fri Apr 25 13:53:11 2008 |
josephb | Summary | Computers | Network setup |
This is the promised more in detail summary from Andrey's log ID 444.
What we did was go around to each hub, one at a time, unplug the network connection, and figure out which light on which hub went out. We then, went back to the control room, confirmed that we were still able to talk to the devices connected to the hub, and if not, rebooted them. This process was repeated for each hub.
As it stands, the hubs located at the ends of arms (in racks 1X4 and 1Y9) are connected to the really old 24 port 10 Base T hub located in 1Y7. In addition, the 5 port SMC hub is plugged into the 8 port SMC switch in 1Y5 (which actually has enough ports to simply move all the connections over to it, so I'm not sure why there are two...).
All other hubs/switches are connected back to the control room 24 port switch.
Attached is a simple diagram of the network connections for the 40m lab. |
Attachment 1: 40m_network_90.pdf
|
|
453
|
Sat Apr 26 11:21:15 2008 |
ajw | Omnistructure | Computers | backup of /cvs/cds restarted |
The backup of /cvs/cds (which runs as a cron job on fb40m; see /cvs/cds/caltech/scripts/backup/000README.txt)
has been down since fb40m was rebooted on March 3.
I was unable to start it because of conflicting ssh keys in /home/controls/.ssh .
With help from Dan Kozak, we got it to work with both sets of keys
( id_rsa, which allows one to ssh between computers in our 113 network without typing a password,
and backup2PB which allows the cron job to push the backup files to the archive in Powell-Booth).
It still goes down every time one reboots fb40m, and I don't have a solution.
A simple solution is for the script to send an email whenever it can't connect via ssh keys
(requiring a restart of ssh-agent with a passphrase), but email doesn't seem to work on fb40m.
I'll see if I can get help on how to have sendmail run on fb40m. |
456
|
Sun Apr 27 18:11:58 2008 |
rob | DAQ | Computers | br40m? |
The testpoint manager (which runs on fb40m) crashed this afternoon. Upon re-starting it, I found there was a rogue dtt process on op440m and also a daqd daemon running on br40m. One or both of these caused the tpman to crash. br40m is the frame broadcaster, which is never used here as we don't run DMT. I killed the daqd process there.
The way to find if there is a rogue process is to watch the output to the console from the tpman when you start it:
Allocate new TP handle 56 by 131.215.113.203
Allocate new TP handle 57 by 131.215.113.203
Allocate new TP handle 58 by 131.215.113.203
Allocate new TP handle 59 by 131.215.113.203
Allocate new TP handle 60 by 131.215.113.203
Allocate new TP handle 61 by 131.215.113.203
Allocate new TP handle 62 by 131.215.113.203
Allocate new TP handle 63 by 131.215.113.203
Allocate new TP handle 64 by 131.215.113.203
Allocate new TP handle 65 by 131.215.113.203
Allocate new TP handle 66 by 131.215.113.203
Allocate new TP handle 67 by 131.215.113.203
Allocate new TP handle 68 by 131.215.113.203
If you see something like this, with a new TP handle being allocated every few seconds, you need to log in to the corresponding host and kill whatever process has run away. |
457
|
Sun Apr 27 22:57:15 2008 |
ajw | DAQ | Computers | br40m? |
Quote: |
The testpoint manager (which runs on fb40m) crashed this afternoon. Upon re-starting it, I found there was a rogue dtt process on op440m and also a daqd daemon running on br40m. One or both of these caused the tpman to crash. br40m is the frame broadcaster, which is never used here as we don't run DMT. I killed the daqd process there.
The way to find if there is a rogue process is to watch the output to the console from the tpman when you start it:
Allocate new TP handle 56 by 131.215.113.203
Allocate new TP handle 57 by 131.215.113.203
Allocate new TP handle 58 by 131.215.113.203
Allocate new TP handle 59 by 131.215.113.203
Allocate new TP handle 60 by 131.215.113.203
Allocate new TP handle 61 by 131.215.113.203
Allocate new TP handle 62 by 131.215.113.203
Allocate new TP handle 63 by 131.215.113.203
Allocate new TP handle 64 by 131.215.113.203
Allocate new TP handle 65 by 131.215.113.203
Allocate new TP handle 66 by 131.215.113.203
Allocate new TP handle 67 by 131.215.113.203
Allocate new TP handle 68 by 131.215.113.203
If you see something like this, with a new TP handle being allocated every few seconds, you need to log in to the corresponding host and kill whatever process has run away. |
I *think* Alex is responsible for the daqd daemon running on br40m (he set up some new stuff recently, a data concentrator and broadcaster); I'll make sure he sees this post. |
463
|
Thu May 1 12:46:02 2008 |
josephb | Configuration | Computers | Nodus gateway is up |
The computer Nodus is now acting as a gateway machine between the GC network and the martian network in the 40m. It has the same passwords as the rana gateway machine.
Its name on the GC side is nodus (ip: 131.215.115.52) and on the martian side is nodus113 (ip: 131.215.113.200). Will need to update the hosts file on the control room machines so you can just use the name nodus113 rather than the full ip.
Software is still being added to the computer, and it will remain in parallel with the rana gateway machine until everything has been working properly for a week or so. |
464
|
Mon May 5 11:04:30 2008 |
rob | Omnistructure | Computers | Network setup |
Mafalda was not connected to the network, and so our DMF-based seisBLRMS has not been running for ~1 week. I traced this to a broken ethernet cable connecting mafalda to the network switch in the rack next to the B&W printer. This cable has a broken connector at the switch side, which means it can't stay connected if there's any tension. It needs to be replaced. |
473
|
Fri May 9 10:15:36 2008 |
josephb | Update | Computers | Nodus has moved |
Steve and myself moved Nodus from under the table in the control room, to just above the Rana computer in the control room rack. |
475
|
Tue May 13 10:38:28 2008 |
steve | Update | Computers | rfm network is down |
The RFM network went down yesterday around 5pm
Only c1susvme1 is alive but it's timing is off.
Andrey is bringing the network up.
Andrey wants to make an addition first that our situation is very much similar to that described by Rana in his elog entry # 353 (March 03). All of the rectangular boxes are red, except for SUS1-c1susvme1 and AWG (only these two rectangles are green). |
476
|
Wed May 14 13:14:19 2008 |
Andrey | Summary | Computers | Reflective Memory Network is restored |
Reflective Memory Network is restored, all watchdogs and oplevs are returned to the "enabled" state.
In order to revive the computers, several things were done.
1) Following Mr. Adhikari's elog entry #353, I walked around the interferometer room, and switched off the power keys in all crates with computers whose names are contained in the MEDM Reflective Memory screen, including the rack with the framebuilder. By the way, it was nontrivial to find the switch in the 1Y4 crate that would shut off/on processors "c1susvme1" and "c1susvme2": the switch turned out to be located at the rear side of the crate, and it is not a key but it is a button.
2) I was trying to follow wiki-40 computer restart procedures, but every time that I was trying to run "startup.cmd" screen from the corresponding target subdirectory, I got the error message "Device or resource busy".
By the way, one more thing was learned: if you firstly open in terminal burtgooey, select the snap file, then reboot the processor, and then will try to burt-restore it, you will get the message "Status Not OK". In order to really burt-restore the processor which was recently rebooted, you need to close the terminal with burtgooey and open burtgooey in a new terminal window which should be opened after rebooting the processor.
Feeling that my activities according to wiki-40 procedures do not revive computers, I invited Alex Ivanov.
3) Alex tried to touch the memory card in "c1iovme" in rack 1Y2, because once before this card failed causing network problems, but this did not help.
4) We shutted off and restarted again (pressing the power-switching button) the black Linux machine "c1dcuepics" (located in the very bottom below the framebuilder). Alex says that this machine is responsible for all EPICS. It was not restarted for 182 days, and probably some process there went wrong.
After restarting this machine "c1dcuepics" we were able to follow wiki-40 procedures for restarting all other computers (whose names are on the MEDM RFM network). We ran correcponding "startup.cmd" files and burt-restored them without error messages.
Now all the computers work and communicate in a proper way.
Mr. Joseph Betzwiezer was helping me with all these activities (we decided that it is more important that cameras for now), thanks to him. But our joint skills turned out to be insufficient, so Alex Ivanov's contribution was the most important. |
477
|
Wed May 14 14:05:40 2008 |
Andrey | Update | Computers | Computer Linux-2, MEDM screen "Watchdogs" |
Computer "Linux-2", MEDM screen "C1SUS_Watchdogs.adl": there is no indication for ETMY watchdogs, everything is white. There is information on that screen "C1SUS_Watchdogs.adl" about all other systems (MC, ETMX,...), but something is wrong with indicators for ETMY on that particular control computer. |
486
|
Sun May 18 18:59:15 2008 |
rana | Configuration | Computers | cron and hosts |
I added rosalba to the hosts file for the control room machines (131.215.113.103).
I also removed the updateddb cron from our op440m crontab because it was running at 5 PM
even though I had set it to run at 5:57 AM. If it still runs then, it must be because of
another crontab. |
492
|
Thu May 22 11:25:19 2008 |
josephb | Configuration | Computers | |
One of the new Netgear Prosafe 24 port switches was mounted in the 1X4 rack,, roughly in the middle, away from the top and bottom rack mounted electronics. At the moment, its IP has been set to 131.215.113.250, gateway 131.215.113.2 (which is what I saw as the only listed gateway on linux1 using route -n) and mask 255.255.255.0.
I'm planning to set the next three IP address for the switches as *.251, *.252 and *.253, which don't look to have been used yet. |
495
|
Sun May 25 16:20:27 2008 |
rana | Configuration | Computers | joinPDF |
I have installed joinPDF 2.1 on rosalba. Since its written in Java, I didn't have to tinker with it at all to work on a 64-bit machine. Now Caryn can put all of her plots into 1 file. |
498
|
Sun May 25 21:14:14 2008 |
tobin | Configuration | Computers | EPICS proxy server |
I set up an EPICS gateway server on Nodus so that we can look at 40m MEDM screens from off-site.
The gateway is set up to allow read access to all channels and write access to none of them.
The executable is /cvs/cds/epics/extensions/gateway; it was already installed. A script to start
up the gateway is in target/epics-gateway. For the time being, I haven't set it up to start itself
on boot or anything like that.
To make it work, you have to set the environment variable EPICS_CA_ADDR_LIST to the IP address of
Nodus. For instance, something like this should work:
setenv EPICS_CA_ADDR_LIST 131.215.115.52
On Windows you can set up environment variables in the "System" Control Panel. On one of the tabs
there's a button that lets you set up environment variables that will be visible to all programs.
On Andrey's machine I installed the Windows EPICS extensions, i.e. MEDM and its friends. I also
installed the cool Tortoise SVN client which lets you interact with SVN repositories through
the windows explorer shell. (The right-click menu now contains SVN options.) I checked out
the MEDM directory from the 40m SVN onto the desktop. You should be able to just right-click in
that window and choose "SVN Update" to get all the newest screens that have been contributed to
SVN; however, there are currently some problems with the 40m SVN that make that not go smoothly.
At the moment on Andrey's (Windows) machine you can go into the MEDM folder and double-click on
any screen and it will just work, with the exception that not all the screens are installed
due to SVN difficulties. |
501
|
Wed May 28 12:51:32 2008 |
josephb | Configuration | Computers | Two more switches mounted |
Two more Prosafe 24 port switches have been mounted in the racks, one in 1Y9 and one in 1Y6. (The first one was placed in 1X4).
The one in 1Y9 has been set to an IP address of 131.215.113.251, while the one in 1Y6 is set to 131.215.113.252, and these have been labeled as such. |
508
|
Fri May 30 21:30:15 2008 |
tobin | Configuration | Computers | svn on solaris |
I installed svn on op440m. This involved installing the following packages from sunfreeware:
apache-2.2.6-sol9-sparc-local libiconv-1.11-sol9-sparc-local subversion-1.4.5-sol9-sparc-local
db-4.2.52.NC-sol9-sparc-local libxml2-2.6.31-sol9-sparc-local swig-1.3.29-sol9-sparc-local
expat-2.0.1-sol9-sparc-local neon-0.25.5-sol9-sparc-local zlib-1.2.3-sol9-sparc-local
gdbm-1.8.3-sol9-sparc-local openssl-0.9.8g-sol9-sparc-local
The packages are located in /cvs/cds/caltech/apps/solaris/packages. The command line to install
a package is "pkgadd -d " followed by the package name. This can be repeated on nodus to get
svn over there. (Kind of egregious to require an apache installation for the svn _client_, I
know.) |
509
|
Sun Jun 1 19:25:10 2008 |
rana | Configuration | Computers | new monitor on op440m |
I installed the new 24" flat screen on op440m. I increased the screen resolution from 1280x1024 to 1900x1200 using
the obscure 'fbconfig' command. You can type Google it if you want.
The old monitor is on the surplus cart. If you are reading this and think you might walk from the 40 over to
Bridge, please wheel the cart full of old computer equipment (on the north side of the control room) over to Larry.
I also copied over all the images on the D40 to a folder on Kirk's computer and deleted the originals.
Dan Busby also visited us last week to help us move the drill press from the Y arm down into the sub basement
of W Bridge. |
Attachment 1: Andrey-440.jpg
|
|
Attachment 2: Busby08.jpg
|
|
510
|
Sun Jun 1 19:39:35 2008 |
tobin | Configuration | Computers | elog, etc |
Phil Ehrens gave me a DVD of the 40m elog, apache, and (Jamie's) SVN archive.
I copied it to nodus:/home/controls/dvd-from-ehrens. Once we get the elog
running on nodus, we can copy the datafile over again from dziban (so that
we don't lose any elog entries) and switch over. |
513
|
Tue Jun 3 10:19:45 2008 |
tobin | Configuration | Computers | big machine |
Several of us transported the big new awesome Sun box from Bridge over to
the 40m last week. If I recall correctly, it's a SunFire X4600 with
something like sixteen 64-bit AMD processor cores at 2.8 GHz. It sounds
like a jet engine when it starts up (before the cooling fans are throttled
back) and has four power supplies (each with its own connection
to the wall). It has slick removable hard disks and fan units too. Our
working name for it is "megatron".
Anyway. It came with two hard disks, one with Solaris 10 installed. I took
the other hard disk over to Alex, who copied a Realtime Linux installation
onto it. Alex says it boots and runs fine.
It remains for you guys to install the machine onto rails and install the
whole thing into a rack. Before it goes into service as a realtime control
machine, you might as well install Matlab on it and do some heavy-duty
computation.
 |
514
|
Tue Jun 3 10:40:27 2008 |
tobin | Configuration | Computers | new dataviewer |
Alex let me know the secret location of the latest dataviewer executable for Linux. It is:
http://www.ligo.caltech.edu/~aivanov/upload/dv/Control/dc3
If your linux dataviewer on linux2 has the "year field not filled in" bug, you should download this into /usr/local/bin/dc3 (after making a backup of that file).
It looks like there's no dataviewer installed on rosalba yet. We should figure out a better directory layout for the linux machines; currently dataviewer is installed locally on linux2. It should be in /cvs/cds/caltech/apps/linux/something so that all the linux machines see the same installation. |
538
|
Wed Jun 18 16:07:57 2008 |
rob | Summary | Computers | RFM network down |
The RFM network tripped off around noon today. It's still down. The problem appears to be with the EPICS interface (c1dcuepics). Trying to restart one of the end stations yields the error: No response from EPICS.
Possible causes include (but not limited to): busted RFM card on c1dcuepics, busted PMC bus on c1dcuepics, busted fiber from c1dcuepics to the RFM switch. We need Alex. |
544
|
Wed Jun 18 18:50:09 2008 |
rana | Update | Computers | It can only be attributable to human error. (HAL - 2001) |
There has been another one of "those" events and all of the front end machines are down.
We poked around and Rob determined that the FEs can't get the EPICS data from EPICS. The
dcuepics machine is hooked up and running and all of the epics binaries are running. We also
tried resetting its RFM switch as well as power cycling the box using the "poweroff" command.
Not a sausage.
Rob points out that although the Signal Detect lights are on on the cards, the 'Own Data' light
is not on on the dcuepics' card although it is on for some of the cards on the other boxes.
We have placed messages with the Russian. If anyone sees him, don't let him go without fixing things.
Also, make sure to follow him around with notepad and possibly a camera to record what it is that
he does. If he's muttering, maybe try to use a sensitive hidden sound recorder. |
545
|
Thu Jun 19 15:52:06 2008 |
Alberto | Configuration | Computers | Measure of the current absorbed by the new Megatron Computer |
Together with Rich Abbot, sam Abbot and I measured the current absorbed by the new Megatron computer that we installed yesterday in the 1Y3 rack. The computer alone absorbs 8.1A at the startup and then goes down to 5.9A at regime. The rest of the rack took 5.2A without the computer so the all rack needs 13.3 at the startup and the 11.1A.
We also measured the current for the 1Y6 rack where an other similar Sun machine has been installed as temporary frame builder and we get 6.5A.
Alberto, Rich and Sam Abbot |
586
|
Fri Jun 27 19:59:44 2008 |
John | Update | Computers | c1iovme |
C1susvme2 and C1iovme crashed which sent the optics swinging and tripped the watchdogs.
Koji and I were able to restore c1susvme2 without any trouble.
We have been unable to revive c1iovme. We have tried telneting in and running startup.cmd,
the process runs for a while then hangs with "DAQ init failed -- exiting".
Resetting the board doesn't help. I didn't try keying the whole crate.
All optics are back to normal with damping restored. |
587
|
Sat Jun 28 03:10:25 2008 |
rob | Update | Computers | c1iovme |
Quote: | C1susvme2 and C1iovme crashed which sent the optics swinging and tripped the watchdogs.
Koji and I were able to restore c1susvme2 without any trouble.
We have been unable to revive c1iovme. We have tried telneting in and running startup.cmd,
the process runs for a while then hangs with "DAQ init failed -- exiting".
Resetting the board doesn't help. I didn't try keying the whole crate.
All optics are back to normal with damping restored. |
I tried keying the crate, then keying the DAQ controller & AWG, then powering down & restarting the framebuilder.
On coming up, the framebuild doesn't start a daqd process, and I can't get one to start by hand (it just prints "652", and then stops).
No error messages and daqd doesn't appear in the prstat.
I then tried keying the DAQ controller again (after the fb0 reboot), which blew the watchdogs on all the suspensions. So then I went around and keyed all the crates.
Now, the suspension controllers are back online. Still no c1iovme, and now the framebuilder/DAQ/AWG are also hosed. We can try keying all the crates again, in the order that Yoichi did last week.
After some more poking around, I found the daqd log file. It's now complaining about
Jun 28 03:00:39 fb daqd[546]: [ID 355684 user.info] Fatal error: channel `C1: PSL-FSS_MIXERM_F' is duplicated 126
This is the second error message like this. It first complained about C1: PSL-FSS_FAST_F, so I commented that out of C1IOOF.ini and rebooted the framebuilder (note this is an actual reboot of the full solaris machine). Eventually I discovered that C1IOOF.ini and C1IOO.ini are essentially identical. They presumably will keep getting these duplicate channel errors until one of them is completely removed.
C1IOO.ini has a modification time of seven PM on Friday night. Who did this and didn't elog it? I've now modified C1IOOF.ini, and I don't remember when it was last modified. |
588
|
Sat Jun 28 14:56:44 2008 |
John | Update | Computers | ini files |
In short, I was editing the ini files yesterday evening, I didn't e-log it and after some investigation this afternoon it apears
that I am to blame for all the computer problems which followed.
I wanted to edit C0EDCU.ini and C1IOOF.ini to change C1: PSL-FSS_FAST to a fast channel as C1: PSL-FSS_FAST_F
was dead.
I opened these files and made backups. It appears this is where it all went awry. My backup for C1IOOF
is called C1IOO.ini.090627 i.e. missing the F.
Later c1susvme2 and c1iovme crashed. After failing to bring c1iovme back I wondered if my edits had
caused the problems so I restored the back up files. It appears that here I wrote over C1IOO with my backup of
C1IOOF (presumably because I had made a typo in the name).
To remedy the situation we could restore C1IOO from e.g. chans/archive/C1IOO_080618_160028.ini
No excuses for not e-logging this activity. |
589
|
Sat Jun 28 23:23:50 2008 |
John | Update | Computers | Rebooting |
All of the computers are now showing green lights.
Remaining problems:
Alignment scripts are failing with "ERROR: LDS - NDS server error #13"
I think this is a server transmission error.
Dataviwer shows all channels as zero. |
592
|
Sun Jun 29 14:53:02 2008 |
rob | Update | Computers | Rebooting |
Quote: | All of the computers are now showing green lights.
Remaining problems:
Alignment scripts are failing with "ERROR: LDS - NDS server error #13"
I think this is a server transmission error.
Dataviwer shows all channels as zero. |
Fixed. Just started the testpoint manager on fb40m.
su
/usr/controls/tpman &
|
593
|
Sun Jun 29 18:58:43 2008 |
rana | Summary | Computers | 1e20 is too big for AWG and/or IOVME |
While testing out my matlab/awgstream based McWFS diagnostic script I accidentally put a
huge excitation into C1:IOO-WFS1_PIT_EXC . This went to 1e20 and then caused
some SUS to trip and c1susvme2 to go red. I tried booting it via the normal procedures
but it wouldn't come back, even after 2 crate power cycles. I also tried booting AWG
via the vmeBusReset, but that didn't do it. Then I booted c1iovme from the telnet prompt
and then I could restart c1susvme2 successfully.
The reason the excitation was so large is that the following filter command is unstable:
[b,a] = butter(4,[0.02 30]/1024);
The low pass part is OK, but it looks like making such a low frequency digital filter
is not. Que lastima. On the bright side, the code now has some excitation amplitude
checking. |
606
|
Mon Jun 30 16:00:02 2008 |
josephb, sam | Configuration | Computers | |
Sam and I setup Cat6 cable from Megatron to the 1Y6 Switch (131.215.113.252) and also connected the 1Y6 Hub to the control room switch.
While I was at it, I checked the configurations of the two switchs now connected (one in 1X4 and one in 1Y6) to the martian network. For some reason, the 1X4 had switched to DHCP enabled and was using 131.215.113.105 as an IP address. I had thought I had setup it correctly initially, so am not sure what caused the change.
The easiest way I know of to check the setup is use smartwizard discovery program from the Netgear install CD (in the equipment manual file cabinet of the control room) on a windows machine. The passwords have been set to the controls password.
Megatron should now see and be accessible through the martian network. |