ID |
Date |
Author |
Type |
Category |
Subject |
1911
|
Sat Aug 15 18:35:14 2009 |
Clara | Frogs | Computers | How far back is complete data really saved? (or, is the cake a lie?) | I was told that, as of last weekend, we now have the capability to save full data for a month, whereas before it was something like 3 days. However, my attempts to get the data from the accidentally-shorted EW2 channel in the Guralp box have all been epic failures. My other data is okay, despite my not saving it for several days after it was recorded. So, my question is, how long can the data actually be saved, and when did the saving capability change? |
1916
|
Mon Aug 17 02:12:53 2009 |
Yoichi | Summary | Computers | FE bootfest | Rana, Yoichi
All the FE computers went red this evening.
We power cycled all of them.
They are all green now.
Not related to this, the CRT display of op540m is not working since Friday night.
We are not sure if it is the failure of the display or the graphics card.
Rana started alarm handler on the LCD display as a temporary measure. |
1934
|
Fri Aug 21 17:49:47 2009 |
Yoichi | Summary | Computers | Upgrade FE conceptual plan | I started to draw a conceptual diagram of the upgraded FE system.
http://nodus.ligo.caltech.edu:30888/FE/FE-Layout.html
(It takes some time to load the page for the first time)
Some places, tips will pop up when you stop the cursor over an object.
You can also click on the LSC block to see inside.
This is just a start point, so we should add more details.
The source of the diagrams are in the svn:
https://nodus.ligo.caltech.edu:30889/svn/trunk/docs/upgrade08/FE/
I used yEd (http://www.yworks.com/en/products_yed_about.html) to make
the drawings. It is Java, so works on many platforms. |
Attachment 1: Picture_2.png
|
|
1936
|
Mon Aug 24 10:43:27 2009 |
Alberto | Omnistructure | Computers | RFM Network Failure | This morning I found that all the front end computers down. A failure of the RFM network drove all the computers down.
I was about to restart them all, but it wasn't necessary. After I power cycled and restarted C1SOSVME all the other computers and RFM network came back to their green status on the MEDM screen. After that I just had to reset and then restart C1SUSVME1/2. |
1939
|
Tue Aug 25 01:27:09 2009 |
rana | Configuration | Computers | Raid update to Framebuilder (not quite) |
Quote: |
The RAID array servicing the Frame builder was finally switched over to JetStor Sata 16 Bay raid array. Each bay contains a 1 TB drive. The raid is configured such that 13 TB is available, and the rest is used for fault protection.
The old Fibrenetix FX-606-U4, a 5 bay raid array which only had 1.5 TB space, has been moved over to linux1 and will be used to store /cvs/cds/.
This upgrade provides an increase in look up times from 3-4 days for all channels out to about 30 days. Final copying of old data occured on August 5th, 2009, and was switched over on that date.
|
Sadly, this was only true in theory and we didn't actually check to make sure anything happened.
We are not able to get lookback of more than ~3 days using our new huge disk. Doing a 'du -h' it seems that this is because we have not yet set the framebuilder to keep more than its old amount of frames. Whoever sees Joe or Alex next should ask them to fix us up. |
1944
|
Tue Aug 25 21:26:12 2009 |
Alberto | Update | Computers | elog restarted | I just found the elog down and I restarted it. |
1947
|
Tue Aug 25 23:16:09 2009 |
Alberto, rana | Configuration | Computers | elog moved in to the cvs path | In nodus, I moved the elog from /export to /cvs/cds/caltech. So now it is in the cvs path instead of a local directory on nodus.
For a while, I'll leave a copy of the old directory containing the logbook subdirectory where it was. If everything works fine, I'll delete that.
I also updated the reboot instructions in the wiki. some of it also is now in the SVN. |
1971
|
Mon Sep 7 23:51:48 2009 |
rana | Configuration | Computers | matlab installed: 64-bit linux | I have wiped out the 2008a install of 64-bit linux matlab and installed 2009a in its place. Enjoy. |
1981
|
Thu Sep 10 15:55:44 2009 |
Jenne | Update | Computers | c1ass rebooted | c1ass had not been rebooted since before the filesystem change, so when I was sshed into c1ass I got an error saying that the NFS was stale. Sanjit and I went out into the cleanroom and powercycled the computer. It came back just fine. We followed the instructions on the wiki, restarting the front end code, the tpman, and did a burt restore of c1assepics. |
1982
|
Thu Sep 10 17:47:25 2009 |
Jenne | Update | Computers | changes to the startass scripts | [Rana, Jenne]
While I was mostly able to restart the c1ass computer earlier today, the filter banks were acting totally weird. They were showing input excitations when we weren't putting any, and they were showing that the outputs were all zero, even though the inputs were non-zero and the input and the output were both enabled. The solution to this ended up being to use the 2nd to last assfe.rtl backup file. Rana made a symbolic link from assfe.rtl to the 2nd to last backup, so that the startup.cmd script does not need to be changed whenever we alter the front end code.
The startup_ass script, in /caltech/target/gds/ which, among other things, starts the awgtpman was changed to match the instructions on the wiki Computer Restart page. We now start up the /opt/gds/awgtpman . This may or may not be a good idea though, since we are currently not able to get channels on DTT and Dataviewer for the C1:ASS-TOP_PEM channels. When we try to run the awgtpman that the script used to try to start ( /caltech/target/gds/bin/ ) we get a "Floating Exception". We should figure this out though, because the /opt/gds/awgtpman does not let us choose 2kHz as an option, which is the rate that the ASS_TOP stuff seems to run at.
The last fix made was to the screen snapshot buttons on the C1:ASS_TOP screen. When the screen was made, the buttons were copied from one of the other ASS screens, so the snapshots saved on the ASS_TOP screen were of the ASS_PIT screen. Not so helpful. Now the update snapshot button will actually update the ASS_TOP snapshot, and we can view past ASS_TOP shots. |
1989
|
Thu Sep 17 14:17:04 2009 |
rob | Update | Computers | awgtpman on c1omc failing to start | [root@c1omc controls]# /opt/gds/awgtpman -2 &
[1] 16618
[root@c1omc controls]# mmapped address is 0x55577000
32 kHz system
Spawn testpoint manager
no test point service registered
Test point manager startup failed; -1
[1]+ Exit 1 /opt/gds/awgtpman -2
|
1990
|
Thu Sep 17 15:05:47 2009 |
rob | Update | Computers | awgtpman on c1omc failing to start |
Quote: |
[root@c1omc controls]# /opt/gds/awgtpman -2 &
[1] 16618
[root@c1omc controls]# mmapped address is 0x55577000
32 kHz system
Spawn testpoint manager
no test point service registered
Test point manager startup failed; -1
[1]+ Exit 1 /opt/gds/awgtpman -2
|
This turned out to be fallout from the /cvs/cds transition. Remounting and restarting fixed it. |
1994
|
Wed Sep 23 17:32:37 2009 |
rob | AoG | Computers | Gremlins in the RFM | A cosmic ray struck the RFM in the framebuilder this afternoon, causing hours of consternation. The whole FE system is just now coming back up, and it appears the mode cleaner is not coming back to the same place (alignment).
rob, jenne |
1996
|
Wed Sep 23 20:02:11 2009 |
Jenne | AoG | Computers | Gremlins in the RFM |
Quote: |
A cosmic ray struck the RFM in the framebuilder this afternoon, causing hours of consternation. The whole FE system is just now coming back up, and it appears the mode cleaner is not coming back to the same place (alignment).
rob, jenne
|
Jenne, Rana, Koji
The mode cleaner has been realigned, using a combination of techniques. First, we used ezcaservo to look at C1:SUS-MC(1,3)_SUS(DOF)_INMON and drive C1:SUS-MC(1,3)_(DOF)_COMM, to put the MC1 and MC3 mirrors back to their DriftMon values. Then we looked at the MC_TRANS_SUM on dataviewer and adjusted the MC alignment sliders by hand to maximize the transmission. Once the transmission was reasonably good, we saw that the spot was still a little high, and the WFS QPDs weren't centered. So Koji and I went out and centered the WFS, and now the MC is back to where it used to be. The MC_TRANS QPD looks nice and centered, so the pointing is back to where it used to be. |
2005
|
Fri Sep 25 19:56:08 2009 |
rana | Configuration | Computers | NTPD restarted on c1dcuepics (to fix the MEDM screen times) | restarted ntp on op440m using this syntax
>su
>/etc/init.d/xntpd start -c /etc/inet/ntp.conf
gettting the time on scipe25 (for the MEDM screen time) working was tougher. The /etc/ntp.conf file was pointing
to the wrong server. Our NAT / Firewall settings require some of our internal machines to go through the gateway
to get NTPD to work. Curiously, some of the linux workstations don't have this issue.
The internal network machines should all have the same file as scipe25's /etc/ntp.conf:
server nodus
and here's how to check that its working:
[root@c1dcuepics sbin]# ./ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
nodus.ligo.calt 0.0.0.0 16 u - 64 0 0.000 0.000 4000.00
*nodus.ligo.calt usno.pa-x.dec.c 2 u 29 64 377 1.688 -65.616 6.647
-lime7.adamantsy clock.trit.net 3 u 32 64 377 37.448 -72.104 4.641
-montpelier.ilan .USNO. 1 u 19 64 377 18.122 -74.984 8.305
+spamd-0.gac.edu nss.nts.umn.edu 3 u 28 64 377 72.086 -66.787 0.540
-mighty.poclabs. time.nist.gov 2 u 30 64 377 71.202 -61.127 4.067
+monitor.xenscal clock.sjc.he.ne 2 u 16 64 377 11.855 -67.105 6.368
|
2026
|
Wed Sep 30 01:04:56 2009 |
rob | Update | Computers | grief | much grief. somehow a burt restore of c1iscepics failed to work, and so the LSC XYCOM settings were not correct. This meant that the LSC whitening filter states were not being correctly set and reported, making it difficult to lock for at least the last week or so. |
Attachment 1: C1LSC_XYCOM.jpg
|
|
2028
|
Wed Sep 30 12:21:08 2009 |
Jenne | Update | Computers | restarted the elog | I'm blaming Zach on this one, for trying to upload a pic into the ATF elog (even though he claims it was small....) Blame assigned: check. Elog restarted: check. |
2041
|
Fri Oct 2 14:52:55 2009 |
rana | Update | Computers | c1susvme2 timing problems update | The attached shows the 200 day '10-minute' trend of the CPU meters and also the room temperature.
To my eye there is no correlation between the signals. Its clear that c1susvme2 (SRM LOAD) is going up and no evidence that its temperature.
|
Attachment 1: Untitled.png
|
|
2042
|
Fri Oct 2 15:11:44 2009 |
rob | Update | Computers | c1susvme2 timing problems update update | It got worse again, starting with locking last night, but it has not recovered. Attached is a 3-day trend of SRM cpu load showing the good spell. |
Attachment 1: srmcpu3.png
|
|
2067
|
Thu Oct 8 11:10:50 2009 |
josephb, jenne | Update | Computers | EPICs Computer troubles | At around 9:45 the RFM/FB network alarm went off, and I found c1asc, c1lsc, and c1iovme not responding.
I went out to hard restart them, and also c1susvme1 and c1susvme2 after Jenne suggested that.
c1lsc seemed to have a promising come back initially, but not really. I was able to ssh in and run the start command. The green light under c1asc on the RFMNETWORK status page lit, but the reset and CPU usage information is still white, as if its not connected. If I try to load an LSC channel, say like PD5_DC monitor, as a testpoint in DTT it works fine, but the 16 Hz monitor version for EPICs is dead. The fact that we were able to ssh into it means the network is working at least somewhat.
I had to reboot c1asc multiple times (3 times total), waiting a full minute on the last power cycle, before being able to telnet in. Once I was able to get in, I restarted the startup.cmd, which did set the DAQ-STATUS to green for c1asc, but its having the same lack of communication as c1lsc with EPICs.
c1iovme was rebooted, was able to telnet in, and started the startup.cmd. The status light went green, but still no epics updates.
The crate containing c1susvme1 and c1susvme2 was power cycled. We were able to ssh into c1susvme1 and restart it, and it came back fully. Status light, cpu load and channels working. However I c1susvme2 was still having problems, so I power cycled the crate again. This time c1susvme2 came back, status light lit green, and its channels started updating.
At this point, lacking any better ideas, I'm going to do a full reboot, cycling c1dcuepics and proceeding through the restart procedures. |
2068
|
Thu Oct 8 11:37:59 2009 |
josephb | Update | Computers | Reboot of dcuepics helped, c1susvme1 having problems | Power cycling c1dcuepics seems to have fixed the EPICs channel problems, and c1lsc, c1asc, and c1iovme are talking again.
I burt restored c1iscepics and c1Iosepics from the snapshot at 6 am this morning.
However, c1susvme1 never came back after the last power cycle of its crate that it shared with c1susvme2. I connected a monitor and keyboard per the reboot instructions. I hit ctrl-x, and it proceeded to boot, however, it displays that there's a media error, PXE-E61, suggests testing the cable, and only offers an option to reboot. From a cursory inspection of the front, the cables seem to look okay. Also, this machine had eventually come back after the first power cycle and I'm pretty sure no cables were moved in between.
|
2069
|
Thu Oct 8 14:41:46 2009 |
jenne | Update | Computers | c1susvme1 is back online |
Quote: |
Power cycling c1dcuepics seems to have fixed the EPICs channel problems, and c1lsc, c1asc, and c1iovme are talking again.
I burt restored c1iscepics and c1Iosepics from the snapshot at 6 am this morning.
However, c1susvme1 never came back after the last power cycle of its crate that it shared with c1susvme2. I connected a monitor and keyboard per the reboot instructions. I hit ctrl-x, and it proceeded to boot, however, it displays that there's a media error, PXE-E61, suggests testing the cable, and only offers an option to reboot. From a cursory inspection of the front, the cables seem to look okay. Also, this machine had eventually come back after the first power cycle and I'm pretty sure no cables were moved in between.
|
I had a go at trying to bring c1susvme1 back online. The first few times I hit the physical reset button, I saw the same error that Joe mentioned, about needing to check some cables. I tried one round of rebooting c1sosvme, c1susvme2 and c1susvme1, with no success. After a few iterations of jiggle cables/reset button/ctrl-x on c1susvme1, it came back. I ran the startup.cmd script, and re-enabled the suspensions, and Mode Cleaner is now locked. So, all systems are back online, and I'm crossing my fingers and toes that they stay that way, at least for a little while. |
2080
|
Mon Oct 12 14:51:41 2009 |
rob | Update | Computers | c1susvme2 timing problems update update update |
Quote: |
It got worse again, starting with locking last night, but it has not recovered. Attached is a 3-day trend of SRM cpu load showing the good spell.
|
Last week, Alex recompiled the c1susvme2 code without the decimation filters for the OUT16 channels, so these channels are now as aliased as the rest of them. This appears to have helped with the timing issues: although it's not completely cured it is much better. Attached is a five day trend. |
Attachment 1: srmcpu.png
|
|
2106
|
Fri Oct 16 16:44:39 2009 |
Alberto, Sanjit | Update | Computers | elog restarted | This afternoon the elog crashed. We just restarted it. |
2179
|
Thu Nov 5 12:34:26 2009 |
kiwamu | Update | Computers | elog rebooted | I found elog got crashed. I rebooted the elog daemon just 10minutes before. |
2182
|
Thu Nov 5 16:30:56 2009 |
pete | Update | Computers | moving megatron | Joe and I moved megatron and its associated IO chassis from 1Y3 to 1Y9, in preparations for RCG tests at ETMY. |
2183
|
Thu Nov 5 16:41:14 2009 |
josephb | Configuration | Computers | Megatron's personal network | In investigating why megatron wouldn't talk to the network, I re-discovered the fact that it had been placed on its own private network to avoid conflicts with the 40m's test point manager. So I moved the linksys router (model WRT310N V2) down to 1Y9, plugged megatron into a normal network port, and connected its internet port to the rest of the gigabit network.
Unfortunately, megatron still didn't see the rest of the network, and vice-versa. I brought out my laptop and started looking at the settings. It had been configured with the DMZ zone on for 192.168.1.2, which was Megatron's IP, so communications should flow through the router. Turns out it needs the dhcp server on the gateway router (131.215.113.2) to be on for everyone to talk to each other. However, this may not be the best practice. It'd probably be better to set the router IP to be fixed, and turn off the dhcp server on the gateway. I'll look into doing this tomorrow.
Also during this I found the DNS server running on linux1 had its IP to name and name to IP files in disagreement on what the IP of megatron should be. The IP to name claimed 131.215.113.95 while the name to IP claimed 131.215.113.178. I set it so both said 131.215.113.178. (These are in /var/named/chroot/var/ directory on linux1, the files are 113.215.131.in-addr.arpa.zone and martian.zone - I modified the 113.215.131.in-addr.arpa.zone file). This is the dhcp served IP address from the gateway, and in principle could change or be given to another machine while the dhcp server is on. |
2187
|
Fri Nov 6 00:23:34 2009 |
Alberto | Configuration | Computers | Elog just rebooted | The elog just crashed and I rebooted it |
2190
|
Fri Nov 6 07:55:59 2009 |
steve | Update | Computers | RFMnetwork is down | The RFMnetwork is down. MC2 sus damping restored. |
2192
|
Fri Nov 6 10:35:56 2009 |
josephb | Update | Computers | RFM reboot fest and re-enabled ITMY coil drivers | As noted by Steve, the RFM network was down this morning. I noticed that c1susvme1 sync counter was pegged at 16384, so I decided to start with reboots in that viscinity.
After power cycling crates containing c1sosvme, c1susvme1, and c1susvme2 (since the reset buttons didn't work) only c1sosvme and c1susvme2 came back normally. I hooked up a monitor and keyboard to c1susvme1, but saw nothing. I power cycled the c1susvme crate again, and this time I watched it boot properly. I'm not sure why it failed the first time.
The RFM network is now operating normally. I have re-enabled the watchdogs again after having turned them off for the reboots. Steve and I also re-enabled the ITMY coil drivers when I noticed them not damping once the watch dogs were re-enabled. The manual switches had been set to disabled, so we re-enabled them. |
2195
|
Fri Nov 6 17:04:01 2009 |
josephb | Configuration | Computers | RFM and Megatron | I took the RFM 5565 card dropped off by Jay and installed it into megatron. It is not very secure, as it was too tall for the slot and could not be locked down. I did not connect the RFM fibers at this point, so just the card is plugged in.
Unfortunately, on power up, and immediately after the splash screen I get "NMI EVENT!" and "System halted due to fatal NMI".
The status light on the RFM light remains a steady red as well. There is a distinct possibility the card is broken in some way.
The card is a VMIPMC-5565 (which is the same as the card used by the ETMY front end machine). We should get Alex to come in and look at it on Monday, but we may need to get a replacement. |
2196
|
Fri Nov 6 18:02:22 2009 |
josephb | Update | Computers | Elog restarted | While I was writing up an elog entry, the elog died again, and I restarted it. Not sure what caused it to die since no one was uploading to it at the time. |
2197
|
Fri Nov 6 18:13:34 2009 |
josephb | Update | Computers | Megatron woes | I have removed the RFM card from Megatron and left it (along with all the other cables and electronics) on the trolly in front of the 1Y9 rack.
Megatron proceeded to boot normally up until it started loading Centos 5. During the linux boot process it checks the file systems. At this point we have an error:
/dev/VolGroup00/LogVol00 contains a file system with errors, check forced
Error reading block 28901403 (Attempt to read block from filesystem resulted short read) while doing inode scan.
/dev/VolGroup00/LogVol00 Unexpected Inconsistency; RUN fsck MANUALLY
So I ran fsck manually, to see if I get some more information. fsck reports back it can't read block 28901403 (due to a short read), and asks if you want to ignore(y)?. I ignore (by hitting space), and unfortunately touch it an additional time. The next question it asks is force rewrite(y)? So I apparently forced a rewrite of that block. On further ignores (but no forced rewrites) I continue seeing short read errors at 28901404, *40, *41,*71, *512, *513, etc. So not totally continugous. Each iteration takes about 5-10 seconds. At this point I reboot, but the same problem happens again, although it starts 28901404 instead of 28901403. So apparently the force re-write fixed something, but I don't know if this is the best way of going about this. I just wondering if there's any other tricks I can try before I just start rewriting random blocks on the hard drive. I also don't know how widespread this problem is and how long it might take to complete (if its a large swath of the hard drive and its take 10 seconds for each block that wrong, it might take a while).
So for the moment, megatron is not functional. Hopefully I can get some advice from Alex on Monday (or from anyone else who wants to chime in). It may wind up being easiest to just wipe the drive and re-install real time linux, but I'm no expert at that.
|
2198
|
Fri Nov 6 18:52:09 2009 |
pete | Update | Computers | RCG ETMY plan | Koji, Joe, and I are planning to try controlling the ETMY, on Monday or Tuesday. Our plan is to try to do this with megatron out of the RFM loop. The RCG system includes pos, pit, yaw, side, and oplevs. I will use matrix elements as currently assigned (i.e. not the ideal case of +1 and -1 everywhere). I will also match channel names to the old channels. We could put buttons on the medm screen to control the analog DW by hand.
This assumes we can get megatron happy again, after the unhappy RFM card test today. See Joe's elog immediately before this one.
We first plan to test that the ETMY watchdog can disable the RCG frontend, by using a pos step (before connecting to the suspension). Hopefully we can make it work without the RFM. Otherwise I think we'll have to wait for a working RFM card.
We plan to disable the other optics. We will disable ETMY, take down the ETMY frontend, switch the cables, and start up the new RCG system. If output looks reasonable we will enable the ETMY via the watchdog. Then I suppose we can put in some small steps via the RCG controller and see if it damps.
Afterwards, we plan to switch everything back. |
2212
|
Mon Nov 9 13:22:08 2009 |
josephb,alex | Update | Computers | Megatron update | Alex and I took a look at megatron this morning, and it was in the same state I left it on Friday, with file system errors. We were able to copy the advLIGO directory Peter had been working in to Linux1, so it should be simple to restore the code. We then tried just running fsck, and overwritting bad sectors, but after about 5 minutes it was clear it could potentially take a long time (5-10 seconds per unreadable block, with an unknown number of blocks, possibly tens or millions). The decision was made to simply replace the hard drive.
Alex is of the opinion that the hard drive failure was a coincidence. Or rather, he can't see how the RFM card could have caused this kind of failure.
Alex went to Bridge to grab a usb to sata adapter for a new hard drive, and was going to copy a duplicate install of the OS onto it, and we'll try replacing the current hard drive with it. |
2215
|
Mon Nov 9 14:59:34 2009 |
josephb, alex | Update | Computers | The saga of Megatron continues | Apparently the random file system failure on megatron was unrelated to the RFM card (or at least unrelated to the physical card itself, its possible I did something while installing it, however unlikely).
We installed a new hard drive, with a duplicate copy of RTL and assorted code stolen from another computer. We still need to get the host name and a variety of little details straightened out, but it boots and can talk to the internet. For the moment though, megatron thinks its name is scipe11.
You still use ssh megatron.martian to log in though.
We installed the RFM card again, and saw the exact same error as before. "NMI EVENT!" and "System halted due to fatal NMI".
Alex has hypothesized that the interface card the actual RFM card plugs into, and which provides the PCI-X connection might be the wrong type, so he has gone back to Wilson house to look for a new interface card. If that doesn't work out, we'll need to acquire a new RFM card at some point
After removing the RFM card, megatron booted up fine, and had no file system errors. So the previous failure was in fact coincidence.
|
2220
|
Mon Nov 9 18:27:30 2009 |
Alberto | Frogs | Computers | OMC DCPD Interface Box Disconnected from the power Supply | This afternoon I inadvertently disconnected one of the power cables coming from the power supply on the floor next to the OMC cabinet and going to the DCPD Interface Box.
Rob reconnected the cable as it was before. |
2221
|
Mon Nov 9 18:32:38 2009 |
rob | Update | Computers | OMC FE hosed | It won't start--it just sits at Waiting for EPICS BURT, even though the EPICS is running and BURTed.
[controls@c1omc c1omc]$ sudo ./omcfe.rtl
cpu clock 2388127
Initializing PCI Modules
3 PCI cards found
***************************************************************************
1 ADC cards found
ADC 0 is a GSC_16AI64SSA module
Channels = 64
Firmware Rev = 3
***************************************************************************
1 DAC cards found
DAC 0 is a GSC_16AO16 module
Channels = 16
Filters = None
Output Type = Differential
Firmware Rev = 1
***************************************************************************
0 DIO cards found
***************************************************************************
1 RFM cards found
RFM 160 is a VMIC_5565 module with Node ID 130
***************************************************************************
Initializing space for daqLib buffers
Initializing Network
Waiting for EPICS BURT
|
2222
|
Mon Nov 9 19:04:23 2009 |
rob | Update | Computers | OMC FE hosed |
Quote: |
It won't start--it just sits at Waiting for EPICS BURT, even though the EPICS is running and BURTed.
[controls@c1omc c1omc]$ sudo ./omcfe.rtl
cpu clock 2388127
Initializing PCI Modules
3 PCI cards found
***************************************************************************
1 ADC cards found
ADC 0 is a GSC_16AI64SSA module
Channels = 64
Firmware Rev = 3
***************************************************************************
1 DAC cards found
DAC 0 is a GSC_16AO16 module
Channels = 16
Filters = None
Output Type = Differential
Firmware Rev = 1
***************************************************************************
0 DIO cards found
***************************************************************************
1 RFM cards found
RFM 160 is a VMIC_5565 module with Node ID 130
***************************************************************************
Initializing space for daqLib buffers
Initializing Network
Waiting for EPICS BURT
|
From looking at the recorded data, it looks like the c1omc started going funny on the afternoon of Nov 5th, perhaps as a side-effect of the Megatron hijinks last week.
It works when megatron is shutdown. |
2224
|
Mon Nov 9 19:44:38 2009 |
rob, rana | Update | Computers | OMC FE hosed |
We found that someone had set the name of megatron to scipe11. This is the same name as the existing c1aux in the op440m /etc/hosts file.
We did a /sbin/shutdown on megatron and the OMC now boots.
Please: check to see that things are working right after playing with megatron or else this will sabotage the DR locking and diagnostics. |
2225
|
Tue Nov 10 10:51:00 2009 |
josephb, alex | Update | Computers | Megatron on, powercycled c1omc, and burt restored from 3am snapshot | Last night around 5pm or so, Alex had remotely logged in and made some fixes to megatron.
First, he changed the local name from scipe11 to megatron. There were no changes to the network, this was a purely local change. The name server running on Linux1 is what provides the name to IP conversions. Scipe11 and Megatron both resolve to distinct IPs. Given c1auxex wasn't reported to have any problems (and I didn't see any problems with it yesterday), this was not a source of conflict. Its possible that Megatron could get confused while in that state, but it would not have affected anything outside its box.
Just to be extra secure, I've switched megatron's personal router over from a DMZ setup to only forwarding port 22. I have also disabled the dhcp server on the gateway router (131.215.113.2).
Second, he turned the mdp and mdc codes on. This should not have conflicted with c1omc.
This morning I came in and turned megatron back on around 9:30 and began trying to replicate the problems from last night between c1omc and megatron. I called Alex and we rebooted c1omc while megatron was on, but not running any code, and without any changes to the setup (routers, etc). We were able to burt restore. Then we turned the mdp, mdc and framebuilder codes on, and again rebooted c1omc, which appeared to burt restore as well (I restored from 3 am this morning, which looks reasonable to me).
Finally, I made the changes mentioned above to the router setups in the hope that this will prevent future problems but without being able to replicate the issue I'm not sure. |
2228
|
Tue Nov 10 17:49:20 2009 |
Alberto | Metaphysics | Computers | Test Point Number Mapping | I found this interesting entry by Rana in the old (deprecated) elog : here
I wonder if Rolf has ever written the mentioned GUI that explained the rationale behind the test point number mapping.
I'm just trying to add the StochMon calibrated channels to the frames. Now I remember why I kept forgetting of doing it... |
2231
|
Tue Nov 10 21:46:31 2009 |
rana | Summary | Computers | Test Point Number Mapping |
Quote: |
I found this interesting entry by Rana in the old (deprecated) elog : here
I wonder if Rolf has ever written the mentioned GUI that explained the rationale behind the test point number mapping.
I'm just trying to add the StochMon calibrated channels to the frames. Now I remember why I kept forgetting of doing it...
|
As far as I know, the EPICS channels have nothing to do with test points. |
2253
|
Thu Nov 12 12:50:35 2009 |
Alberto | Update | Computers | StochMon calibrated channels added to the data trend | I added the StochMon calibrated channels to the data trend by including the following channel names in the C0EDCU.ini file:
[C1:IOO-RFAMPD_33MHZ_CAL]
[C1:IOO-RFAMPD_133MHZ_CAL]
[C1:IOO-RFAMPD_166MHZ_CAL]
[C1:IOO-RFAMPD_199MHZ_CAL]
Before saving the changes I committed C0EDCU.ini to the svn.
Then I restarted the frame builder so now the new channels can be monitored and trended. |
2255
|
Thu Nov 12 15:40:27 2009 |
josephb, koji, peter | Update | Computers | ETMY and Megatron test take 1 | We connected megatron to the IO chassis which in turn was plugged into the rest of the ITMY setup. We had manually turned the watchdogs off before we touched anything, to ensure we didn't accidently drive the optic. The connections seem to go smoothly.
However, on reboot of megatron with the IO chassis powered up, we were unable to actually start the code. (The subsystem has been renamed from SAS to TST, short for test). While starttst claimed to start the IOC Server, we couldn't find the process running, nor did the medm screens associated with it work.
As a sanity test, we tried running mdp, Peter's plant model, but even that didn't actually run. Although it also gave an odd error we hadn't seen before:
"epicsThreadOnceOsd epicsMutexLock failed."
Running startmdp a second time didn't give the error message, but still no running code. The mdp medm screens remained white.
We turned the IO chassis off and rebooted megatron, but we're still having the same problem.
Things to try tomorrow:
1) Try disconnecting megatron completely from the IO chassis and get it to a state identical to that of last night, when the mdp and mdc did run.
2) Confirm the .mdl files are still valid, and try rebuilding them |
2264
|
Fri Nov 13 09:47:18 2009 |
josephb | Update | Computers | Megatron status lights lit | Megatron's top fan, rear ps, and temperature front panel lights were all lit amber this morning. I checked the service manual, found at :
http://docs.sun.com/app/docs/prod/sf.x4600m2?l=en&a=view
According to the manual, this means a front fan failed, a voltage event occured, and we hit a high temperature threshold. However, there were no failure light on any of the individual front fans (which should have been the case given the front panel fan light). The lights remained on after I shutdown megatron. After unplugging, waiting 30 seconds, and replugging the power cords in, the lights went off and stayed off. Megatron seems to come up fine.
I unplugged the IO chassis from megatron, rebooted, and tried to start Peter's plant model. However, it still prints that its starting, but really doesn't. One thing I forgot to mention in the previous elog on the matter, is that on the local monitor it prints "shm_open(): No such file or directory" every time we try to start one of these programs. |
2265
|
Fri Nov 13 09:54:14 2009 |
josephb | Configuration | Computers | Megatron switched to tcsh | I've changed megatron's controls account default shell to tcsh (like it was before). It now sources cshrc.40m in /cvs/cds/caltech/ correctly at login, so all the usual aliases and programs work without doing any extra work. |
2266
|
Fri Nov 13 10:28:03 2009 |
josephb, alex | Update | Computers | Megatron is back to its old self | I called Alex this morning and explained the problems with megatron.
Turns out when he had been setting up megatron, he thought a startup script file, rc.local was missing in the /etc directory. So he created it. However, the rc.local file in the /etc directory is normally just a link to the /etc/rc.d/rc.local file. So on startup (basically when we rebooted the machine yesterday), it was running an incorrect startup script file. The real rc.local includes line:
/usr/bin/setup_shmem.rtl mdp mdc&
Hence the errors we were getting with shm_open(). We changed the file into a soft link, and resourced the rc.local script and mdp started right up. So we're back to where we were 2 nights ago (although we do have an RFM card in hand).
Update: The tst module wouldn't start, but after talking to Alex again, it seems that I need to add the module tst to the /usr/bin/setup_shmem.rtl mdp mdc& line in order for it to have a shared memory location setup for it. I have edited the file (/etc/rc.d/rc.local), adding tst at the end of the line. On reboot and running starttst, the code actually loads, although for the moment, I'm still getting blank white blocks on the medm screens. |
2267
|
Fri Nov 13 14:04:27 2009 |
josephb, koji | Update | Computers | Updated wiki with RCG instructions/tips | I've placed some notes pertaining to what Koji and I have learned today about getting the RCG code working on the 40m wiki at:
http://lhocds.ligo-wa.caltech.edu:8000/40m/Notes_on_getting_the_CDS_Realtime_Code_Generator_working
We're still trying to fix the tst system, as the moment its reporting an invalid number of daq channels and during daq initialization it fails. (This from the /cvs/cds/caltech/target/c1tst/log.txt file). Note: This problem is only on megatron and separated from the conventional DAQ system of the 40m.
cpu clock 2800014
Warning, could open `/rtl_mem_tst' read/write (errno=0)
configured to use 2 cards
Initializing PCI Modules
2 PCI cards found
***************************************************************************
1 ADC cards found
ADC 0 is a GSC_16AI64SSA module
Channels = 64
Firmware Rev = 512
***************************************************************************
1 DAC cards found
DAC 0 is a GSC_16AO16 module
Channels = 16
Filters = None
Output Type = Differential
Firmware Rev = 3
***************************************************************************
0 DIO cards found
***************************************************************************
0 IIRO-8 Isolated DIO cards found
***************************************************************************
0 IIRO-16 Isolated DIO cards found
***************************************************************************
0 Contec 32ch PCIe DO cards found
0 DO cards found
***************************************************************************
0 RFM cards found
***************************************************************************
Initializing space for daqLib buffers
Initializing Network
Found 1 frameBuilders on network
Waiting for EPICS BURT at 0.000000 and 0 ns 0x3c40c004
BURT Restore = 1
Waiting for Network connect to FB - 10
Reconn status = 0 1
Reconn Check = 0 1
Initialized servo control parameters.
DAQ Ex Min/Max = 1 32
DAQ Tp Min/Max = 10001 10094
DAQ XTp Min/Max = 10094 10144
DAQ buffer 0 is at 0x8819a020
DAQ buffer 1 is at 0x8839a020
daqLib DCU_ID = 10
DAQ DATA INFO is at 0x3e40f0a0
Invalid num daq chans = 0
DAQ init failed -- exiting |
2268
|
Fri Nov 13 15:01:07 2009 |
Jenne | Update | Computers | Updated wiki with RCG instructions/tips |
Quote: |
I've placed some notes pertaining to what Koji and I have learned today about getting the RCG code working on the 40m wiki at:
http://lhocds.ligo-wa.caltech.edu:8000/40m/Notes_on_getting_the_CDS_Realtime_Code_Generator_working
We're still trying to fix the tst system, as the moment its reporting an invalid number of daq channels and during daq initialization it fails. (This from the /cvs/cds/caltech/target/c1tst/log.txt file).
|
Dmass tells me that you have to record at least one channel. ie at least one channel in your .ini file must be set to acquire, otherwise the DAQ will flip out. It seems to be unhappy when you're not asking it to do things. |
|