ID |
Date |
Author |
Type |
Category |
Subject |
1805
|
Wed Jul 29 12:14:40 2009 |
pete | Update | Computers | RCG work |
Koji, Pete
Yesterday, Jay brought over the IO box for megatron, and got it working. We plan to firewall megatron this afternoon, with the help of Jay and Alex, so we can set up GDS there and play without worrying about breaking things. In the meantime, we went to Wilson House to get some breakout boards so we can take transfer functions with the 785, for an ETMX controller. We put in a sine wave, and all looks good on the auto-generated epics screens, with an "empty" system (no filters on). Next we'll load in filters and take transfer functions.
Unfortunately we promised to return the breakout boards by 1pm today. This is because, according to denizens of Wilson House, Osamu "borrowed" all their breakout boards and these were the last two! If we can't locate Osamu's cache, they expect to have more in a day or two.
Here is the transfer function of the through filter working at 16KHz sampling. It looks fine except for the fact that the dc gain is ~0.8. Koji is going to characterize the digital down sampling filter in order to try to compare with the generated code and the filter coefficients.
|
Attachment 1: TF090729_1.png
|
|
Attachment 2: TF090729_1.png
|
|
1809
|
Wed Jul 29 19:31:17 2009 |
rana | Configuration | Computers | elog restarted |
Just now found it dead. Restarted it. Is our elog backed up in the daily backups? |
1819
|
Mon Aug 3 13:47:42 2009 |
pete | Update | Computers | RCG work |
Alex has firewalled megatron. We have started a framebuilder there and added testpoints. Now it is possible to take transfer functions with the shared memory MDC+MDP sandbox system. I have also copied filters into MDC (the controller) and made a really ugly medm master screen for the system, which I will show to no one. |
1826
|
Tue Aug 4 13:40:17 2009 |
pete | Update | Computers | RCG work - rate |
Koji, Pete
Yesterday we found that the channel C1:MDP-POS_EXC looked distorted and had what appeared to be doubled frequency componenets, in the dataviewer. This was because the dcu_rate in the file /caltech/target/fb/daqdrc was set to 16K while the adl file was set to 32K. When daqdrc was corrected it was fixed. I am going to recompile and run all these models at 16K. Once the 40 m moves over to the new front end system, we may find it advantageous to take advantage of the faster speeds, but maybe it's a good idea to get everything working at 16K first. |
1827
|
Tue Aug 4 15:48:25 2009 |
Jenne | Update | Computers | mini boot fest |
Last night Rana noticed that the overflows on the ITM and ETM coils were a crazy huge number. Today I rebooted c1dcuepics, c1iovme, c1sosvme, c1susvme1 and c1susvme2 (in that order). Rob helped me burt restore losepics and iscepics, which needs to be done whenever you reboot the epics computer.
Unfortunately this didn't help the overflow problem at all. I don't know what to do about that. |
1828
|
Tue Aug 4 16:12:27 2009 |
rob | Update | Computers | mini boot fest |
Quote: |
Last night Rana noticed that the overflows on the ITM and ETM coils were a crazy huge number. Today I rebooted c1dcuepics, c1iovme, c1sosvme, c1susvme1 and c1susvme2 (in that order). Rob helped me burt restore losepics and iscepics, which needs to be done whenever you reboot the epics computer.
Unfortunately this didn't help the overflow problem at all. I don't know what to do about that.
|
Just start by re-setting them to zero. Then you have to figure out what's causing them to saturate by watching time series and looking at spectra. |
1829
|
Tue Aug 4 17:51:25 2009 |
pete | Update | Computers | RCG work |
Koji, Peter
We put a simple pendulum into the MDP model, and everything communicates. We're still having some kind of TP or daq problem, so we're still in debugging mode. We went back to 32K in the .adl's, and when driving MDP, the MDC-ETMX_POS_OUT is nasty, it follows the sine wave envelope but goes to zero 16 times per second.
The breakout boards have arrived. The plan is to fix this daq problem, then demonstrate the model MDC/MDP system. Then we'll switch to the "external" system (called SAM) and match control TF to the model. Then we'd like to hook up ETMX, and run the system isolated from the rest of the IFO. Finally we'd like to tie it into the IFO using reflective memory. |
1831
|
Wed Aug 5 07:33:04 2009 |
steve | DAQ | Computers | fb40m is down |
|
1832
|
Wed Aug 5 09:25:57 2009 |
Alberto | DAQ | Computers | fb40m is up |
FB40m up and running again after restarting the DAQ. |
1837
|
Wed Aug 5 15:57:05 2009 |
Alberto | Configuration | Computers | PMC MEDM screen changed |
I added a clock to the PMC medm screen.
I made a backup of the original file in the same directory and named it *.bk20090805 |
1839
|
Wed Aug 5 17:41:54 2009 |
pete | Update | Computers | RCG work - daq fixed |
The daq on megatron was nuts. Alex and I discovered that there was no gds installation for site_letter=C (i.e. Caltech) so the default M was being used (for MIT). Apparently we are the first Caltech installation. We added the appropriate line to the RCG Makefile and recompiled and reinstalled (at 16K). Now DV looks good on MDP and MDC, and I made a transfer function that replicates bounce-roll filter. So DTT works too. |
1854
|
Fri Aug 7 13:42:12 2009 |
ajw | Omnistructure | Computers | backup of frames restored |
Ever since July 22, the backup script that runs on fb40m has failed to ssh to ldas-cit.ligo.caltech.edu to back up our trend frames and /cvs/cds.
This was a new failure mode which the scripts didn't catch, so I only noticed it when fb40m was rebooted a couple of days ago.
Alex fixed the problem (RAID array was configured with the wrong IP address, conflicting with the outside world), and I modified the script ( /cvs/cds/caltech/scripts/backup/rsync.backup ) to handle the new directory structure Alex made.
Now the backup is current and the automated script should keep it so, at least until the next time fb40m is rebooted...
|
1856
|
Fri Aug 7 16:00:17 2009 |
pete | Update | Computers | RCG work. MDC MDP open loop transfer function |
Today I was able to make low frequency transfer function with DTT on megatron. There seems to have been a timing problem, perhaps Alex fixed it or it is intermittent.
I have attached the open loop transfer function for the un-optimized system, which is at least stable to step impulses with the current filters and gains. The next step is to optimize, transfer this knowledge to the ADC/DAC version, and hook it up to isolated ETMX. |
Attachment 1: tf_au_natural.pdf
|
|
1870
|
Sun Aug 9 16:32:18 2009 |
rana | Update | Computers | RCG work. MDC MDP open loop transfer function |
This is very nice. We have, for the first time, a real time plant with which we can test our changes of the control system. From my understanding, we have a control system with the usual POS/PIT/YAW matrices and filter banks. The outputs go to a separate real-time system which is running something similar and where we have loaded the pendulum TF as a filter. Cross-couplings, AA & AI filters, and saturations to come later.
The attached plot is just the same as what Peter posted earlier, but with more resolution. I drove at the input to the SUSPOS filter bank and measured the open loop with the loop closed. The loop wants an overall gain of -0.003 or so to be stable. |
Attachment 1: a.png
|
|
1879
|
Mon Aug 10 17:36:32 2009 |
pete | Update | Computers | RCG work. PIT, YAW, POS in MDP/MDC system |
I've added the PIT and YAW dofs to the MDC and MDP systems. The pendula frequencies in MDP are 0.8, 0.5, 0.6 Hz for POS, PIT, and YAW respectively. The three dofs are linear and uncoupled, and stable, but there is no modeled noise in the system (yet) and some gains may need bumping up in the presence of noise. The MDC filters are identical for each dof (3:0.0 and Cheby). The PIT and YAW transfer functions look pretty much like the one Rana recently took of POS, but of course with the different pendulum frequencies. I've attached one for YAW. |
Attachment 1: mdcmdpyaw.jpg
|
|
1881
|
Mon Aug 10 17:49:10 2009 |
pete | Update | Computers | RCG work - plans |
Pete, Koji
We discussed a preliminary game plan for this project. The thing I really want to see is an ETMX RCG controller hooked into the existing frontend via reflective memory, and the 40 m behaving normally with this hybrid system, and my list is geared toward this. I suspect the list may cause controversy.
+ copy the MDC filters into SAM, and make sure everything looks good there with DTT and SR785.
+ get interface / wiring boards from Wilson House, to go between megatron and the analog ETMX system
+ test tying the ETMX pendulum and bare-bones SAM together (use existing watchdogs, and "bare-bones" needs defining)
+ work some reflective memory magic and create the hybrid frontend
In parallel with the above, the following should also happen:
+ MEDM screen design
+ add non-linear bits to the ETMX MDP/MDC model system
+ make game plan for the rest of the RCG frontend |
1887
|
Tue Aug 11 23:17:21 2009 |
rana | Summary | Computers | Nodus rebooted / SVN down |
Looks like someone rebooted nodus at ~3 PM today but did not elog it. Also the SVN is not running. Why? |
1890
|
Wed Aug 12 10:35:17 2009 |
jenne | Summary | Computers | Nodus rebooted / SVN down |
Quote: |
Looks like someone rebooted nodus at ~3 PM today but did not elog it. Also the SVN is not running. Why?
|
The Nodus business was me....my bad. Nodus and the elog were both having a bad day (we couldn't ssh into nodus from op440m (which doesn't depend on the names server)), so I called Alex, and he fixed things, although I think that all he did was reboot. I then restarted the elog per the instructions on the wiki.
|
1892
|
Wed Aug 12 13:35:03 2009 |
josephb, Alex | Configuration | Computers | Tested old Framebuilder 1.5 TB raid array on Linux1 |
Yesterday, Alex attached the old frame builder 1.5 TB raid array to linux1, and tested to make sure it would work on linux1.
This morning he tried to start a copy of the current /cvs/cds structure, however realized at the rate it was going it would take it roughly 5 hours, so he stopped.
Currently, it is planned to perform this copy on this coming Friday morning. |
1893
|
Wed Aug 12 15:02:33 2009 |
Alberto | Configuration | Computers | elog restarted |
In the last hour or so the elog crashed. I have restarted it. |
1901
|
Fri Aug 14 10:39:50 2009 |
josephb | Configuration | Computers | Raid update to Framebuilder (specs) |
The RAID array servicing the Frame builder was finally switched over to JetStor Sata 16 Bay raid array. Each bay contains a 1 TB drive. The raid is configured such that 13 TB is available, and the rest is used for fault protection.
The old Fibrenetix FX-606-U4, a 5 bay raid array which only had 1.5 TB space, has been moved over to linux1 and will be used to store /cvs/cds/.
This upgrade provides an increase in look up times from 3-4 days for all channels out to about 30 days. Final copying of old data occured on August 5th, 2009, and was switched over on that date. |
1902
|
Fri Aug 14 14:19:25 2009 |
Koji | Summary | Computers | nodus rebooted |
nodus was rebooted by Alex at Fri Aug 14 13:53. I launched elogd.
cd /export/elog/elog-2.7.5/
./elogd -p 8080 -c /export/elog/elog-2.7.5/elogd.cfg -D |
1903
|
Fri Aug 14 14:33:51 2009 |
Jenne | Summary | Computers | nodus rebooted |
Quote: |
nodus was rebooted by Alex at Fri Aug 14 13:53. I launched elogd.
cd /export/elog/elog-2.7.5/
./elogd -p 8080 -c /export/elog/elog-2.7.5/elogd.cfg -D
|
It looks like Alex also rebooted all of the control room computers. Or something. The alarm handler and strip tool aren't running.....after I fix susvme2 (which was down when I got in earlier today), I'll figure out how to restart those. |
1904
|
Fri Aug 14 15:20:42 2009 |
josephb | Summary | Computers | Linux1 now has 1.5 TB raid drive |
Quote: |
Quote: |
nodus was rebooted by Alex at Fri Aug 14 13:53. I launched elogd.
cd /export/elog/elog-2.7.5/
./elogd -p 8080 -c /export/elog/elog-2.7.5/elogd.cfg -D
|
It looks like Alex also rebooted all of the control room computers. Or something. The alarm handler and strip tool aren't running.....after I fix susvme2 (which was down when I got in earlier today), I'll figure out how to restart those.
|
Alex switched the mount point for /cvs/cds on Linux1 to the 1.5 TB RAID array after he finished copying the data from old drives. This required a reboot of linux1, with all the resulting /cvs/cds mount points on the other computers becoming stale. Easiest way to fix that he found was to do a reboot of all the control room machines. In addition, a reboot fest should probably happen in the near futuer for all the front end machines since they will also have stale mount points as well from linux1.
The 1.5 TB RAID array mount is now mounted on /home of linux1, which was the old mount point of the ~300 GB drive. The old drive is now at /oldhome on linux1.
|
1905
|
Fri Aug 14 15:29:43 2009 |
Jenne | Update | Computers | c1susvme2 was unmounted from /cvs/cds |
When I came in earlier today, I noticed that c1susvme2 was red on the DAQ screens. Since the vme computers always seem to be happier as a set, I hit the physical reset buttons on sosvme, susvme1 and susvme2. I then did the telnet or ssh in as appropriate for each computer in turn. sosvme and susvme1 came back just fine. However, I couldn't cd to /cvs/cds/caltech/target/c1susvme2 while ssh-ed in to susvme2. I could cd to /cvs/cds, and then did an ls, and it came back totally blank. There was nothing at all in the folder.
Yoichi showed me how to do 'df' to figure out what filesystems are mounted, and it looked as though the filesystem was mounted. But then Yoichi tried to unmount the filesystem, and it claimed that it wasn't mounted at all. We then remounted the filesystem, and things were good again. I was able to continue the regular restart procedure, and the computer is back up again.
Recap: c1susvme2 mysteriously got unmounted from /cvs/cds! But it's back, and the computers are all good again. |
1906
|
Fri Aug 14 15:32:50 2009 |
Yoichi | HowTo | Computers | nodus boot procedure |
The restart procedures for the various processes running on nodus are explained here:
http://lhocds.ligo-wa.caltech.edu:8000/40m/Computer_Restart_Procedures#nodus
Please go through those steps when you reboot nodus, or notice it rebooted then elog it.
I did these this time. |
1910
|
Sat Aug 15 10:36:02 2009 |
Alan | HowTo | Computers | nodus boot procedure |
fb40m was also rebooted. I restarted the ssh-agent for backup of minute-trend and /cvs/cds. |
1911
|
Sat Aug 15 18:35:14 2009 |
Clara | Frogs | Computers | How far back is complete data really saved? (or, is the cake a lie?) |
I was told that, as of last weekend, we now have the capability to save full data for a month, whereas before it was something like 3 days. However, my attempts to get the data from the accidentally-shorted EW2 channel in the Guralp box have all been epic failures. My other data is okay, despite my not saving it for several days after it was recorded. So, my question is, how long can the data actually be saved, and when did the saving capability change? |
1916
|
Mon Aug 17 02:12:53 2009 |
Yoichi | Summary | Computers | FE bootfest |
Rana, Yoichi
All the FE computers went red this evening.
We power cycled all of them.
They are all green now.
Not related to this, the CRT display of op540m is not working since Friday night.
We are not sure if it is the failure of the display or the graphics card.
Rana started alarm handler on the LCD display as a temporary measure. |
1934
|
Fri Aug 21 17:49:47 2009 |
Yoichi | Summary | Computers | Upgrade FE conceptual plan |
I started to draw a conceptual diagram of the upgraded FE system.
http://nodus.ligo.caltech.edu:30888/FE/FE-Layout.html
(It takes some time to load the page for the first time)
Some places, tips will pop up when you stop the cursor over an object.
You can also click on the LSC block to see inside.
This is just a start point, so we should add more details.
The source of the diagrams are in the svn:
https://nodus.ligo.caltech.edu:30889/svn/trunk/docs/upgrade08/FE/
I used yEd (http://www.yworks.com/en/products_yed_about.html) to make
the drawings. It is Java, so works on many platforms. |
Attachment 1: Picture_2.png
|
|
1936
|
Mon Aug 24 10:43:27 2009 |
Alberto | Omnistructure | Computers | RFM Network Failure |
This morning I found that all the front end computers down. A failure of the RFM network drove all the computers down.
I was about to restart them all, but it wasn't necessary. After I power cycled and restarted C1SOSVME all the other computers and RFM network came back to their green status on the MEDM screen. After that I just had to reset and then restart C1SUSVME1/2. |
1939
|
Tue Aug 25 01:27:09 2009 |
rana | Configuration | Computers | Raid update to Framebuilder (not quite) |
Quote: |
The RAID array servicing the Frame builder was finally switched over to JetStor Sata 16 Bay raid array. Each bay contains a 1 TB drive. The raid is configured such that 13 TB is available, and the rest is used for fault protection.
The old Fibrenetix FX-606-U4, a 5 bay raid array which only had 1.5 TB space, has been moved over to linux1 and will be used to store /cvs/cds/.
This upgrade provides an increase in look up times from 3-4 days for all channels out to about 30 days. Final copying of old data occured on August 5th, 2009, and was switched over on that date.
|
Sadly, this was only true in theory and we didn't actually check to make sure anything happened.
We are not able to get lookback of more than ~3 days using our new huge disk. Doing a 'du -h' it seems that this is because we have not yet set the framebuilder to keep more than its old amount of frames. Whoever sees Joe or Alex next should ask them to fix us up. |
1944
|
Tue Aug 25 21:26:12 2009 |
Alberto | Update | Computers | elog restarted |
I just found the elog down and I restarted it. |
1947
|
Tue Aug 25 23:16:09 2009 |
Alberto, rana | Configuration | Computers | elog moved in to the cvs path |
In nodus, I moved the elog from /export to /cvs/cds/caltech. So now it is in the cvs path instead of a local directory on nodus.
For a while, I'll leave a copy of the old directory containing the logbook subdirectory where it was. If everything works fine, I'll delete that.
I also updated the reboot instructions in the wiki. some of it also is now in the SVN. |
1971
|
Mon Sep 7 23:51:48 2009 |
rana | Configuration | Computers | matlab installed: 64-bit linux |
I have wiped out the 2008a install of 64-bit linux matlab and installed 2009a in its place. Enjoy. |
1981
|
Thu Sep 10 15:55:44 2009 |
Jenne | Update | Computers | c1ass rebooted |
c1ass had not been rebooted since before the filesystem change, so when I was sshed into c1ass I got an error saying that the NFS was stale. Sanjit and I went out into the cleanroom and powercycled the computer. It came back just fine. We followed the instructions on the wiki, restarting the front end code, the tpman, and did a burt restore of c1assepics. |
1982
|
Thu Sep 10 17:47:25 2009 |
Jenne | Update | Computers | changes to the startass scripts |
[Rana, Jenne]
While I was mostly able to restart the c1ass computer earlier today, the filter banks were acting totally weird. They were showing input excitations when we weren't putting any, and they were showing that the outputs were all zero, even though the inputs were non-zero and the input and the output were both enabled. The solution to this ended up being to use the 2nd to last assfe.rtl backup file. Rana made a symbolic link from assfe.rtl to the 2nd to last backup, so that the startup.cmd script does not need to be changed whenever we alter the front end code.
The startup_ass script, in /caltech/target/gds/ which, among other things, starts the awgtpman was changed to match the instructions on the wiki Computer Restart page. We now start up the /opt/gds/awgtpman . This may or may not be a good idea though, since we are currently not able to get channels on DTT and Dataviewer for the C1:ASS-TOP_PEM channels. When we try to run the awgtpman that the script used to try to start ( /caltech/target/gds/bin/ ) we get a "Floating Exception". We should figure this out though, because the /opt/gds/awgtpman does not let us choose 2kHz as an option, which is the rate that the ASS_TOP stuff seems to run at.
The last fix made was to the screen snapshot buttons on the C1:ASS_TOP screen. When the screen was made, the buttons were copied from one of the other ASS screens, so the snapshots saved on the ASS_TOP screen were of the ASS_PIT screen. Not so helpful. Now the update snapshot button will actually update the ASS_TOP snapshot, and we can view past ASS_TOP shots. |
1989
|
Thu Sep 17 14:17:04 2009 |
rob | Update | Computers | awgtpman on c1omc failing to start |
[root@c1omc controls]# /opt/gds/awgtpman -2 &
[1] 16618
[root@c1omc controls]# mmapped address is 0x55577000
32 kHz system
Spawn testpoint manager
no test point service registered
Test point manager startup failed; -1
[1]+ Exit 1 /opt/gds/awgtpman -2
|
1990
|
Thu Sep 17 15:05:47 2009 |
rob | Update | Computers | awgtpman on c1omc failing to start |
Quote: |
[root@c1omc controls]# /opt/gds/awgtpman -2 &
[1] 16618
[root@c1omc controls]# mmapped address is 0x55577000
32 kHz system
Spawn testpoint manager
no test point service registered
Test point manager startup failed; -1
[1]+ Exit 1 /opt/gds/awgtpman -2
|
This turned out to be fallout from the /cvs/cds transition. Remounting and restarting fixed it. |
1994
|
Wed Sep 23 17:32:37 2009 |
rob | AoG | Computers | Gremlins in the RFM |
A cosmic ray struck the RFM in the framebuilder this afternoon, causing hours of consternation. The whole FE system is just now coming back up, and it appears the mode cleaner is not coming back to the same place (alignment).
rob, jenne |
1996
|
Wed Sep 23 20:02:11 2009 |
Jenne | AoG | Computers | Gremlins in the RFM |
Quote: |
A cosmic ray struck the RFM in the framebuilder this afternoon, causing hours of consternation. The whole FE system is just now coming back up, and it appears the mode cleaner is not coming back to the same place (alignment).
rob, jenne
|
Jenne, Rana, Koji
The mode cleaner has been realigned, using a combination of techniques. First, we used ezcaservo to look at C1:SUS-MC(1,3)_SUS(DOF)_INMON and drive C1:SUS-MC(1,3)_(DOF)_COMM, to put the MC1 and MC3 mirrors back to their DriftMon values. Then we looked at the MC_TRANS_SUM on dataviewer and adjusted the MC alignment sliders by hand to maximize the transmission. Once the transmission was reasonably good, we saw that the spot was still a little high, and the WFS QPDs weren't centered. So Koji and I went out and centered the WFS, and now the MC is back to where it used to be. The MC_TRANS QPD looks nice and centered, so the pointing is back to where it used to be. |
2005
|
Fri Sep 25 19:56:08 2009 |
rana | Configuration | Computers | NTPD restarted on c1dcuepics (to fix the MEDM screen times) |
restarted ntp on op440m using this syntax
>su
>/etc/init.d/xntpd start -c /etc/inet/ntp.conf
gettting the time on scipe25 (for the MEDM screen time) working was tougher. The /etc/ntp.conf file was pointing
to the wrong server. Our NAT / Firewall settings require some of our internal machines to go through the gateway
to get NTPD to work. Curiously, some of the linux workstations don't have this issue.
The internal network machines should all have the same file as scipe25's /etc/ntp.conf:
server nodus
and here's how to check that its working:
[root@c1dcuepics sbin]# ./ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
nodus.ligo.calt 0.0.0.0 16 u - 64 0 0.000 0.000 4000.00
*nodus.ligo.calt usno.pa-x.dec.c 2 u 29 64 377 1.688 -65.616 6.647
-lime7.adamantsy clock.trit.net 3 u 32 64 377 37.448 -72.104 4.641
-montpelier.ilan .USNO. 1 u 19 64 377 18.122 -74.984 8.305
+spamd-0.gac.edu nss.nts.umn.edu 3 u 28 64 377 72.086 -66.787 0.540
-mighty.poclabs. time.nist.gov 2 u 30 64 377 71.202 -61.127 4.067
+monitor.xenscal clock.sjc.he.ne 2 u 16 64 377 11.855 -67.105 6.368
|
2026
|
Wed Sep 30 01:04:56 2009 |
rob | Update | Computers | grief |
much grief. somehow a burt restore of c1iscepics failed to work, and so the LSC XYCOM settings were not correct. This meant that the LSC whitening filter states were not being correctly set and reported, making it difficult to lock for at least the last week or so. |
Attachment 1: C1LSC_XYCOM.jpg
|
|
2028
|
Wed Sep 30 12:21:08 2009 |
Jenne | Update | Computers | restarted the elog |
I'm blaming Zach on this one, for trying to upload a pic into the ATF elog (even though he claims it was small....) Blame assigned: check. Elog restarted: check. |
2041
|
Fri Oct 2 14:52:55 2009 |
rana | Update | Computers | c1susvme2 timing problems update |
The attached shows the 200 day '10-minute' trend of the CPU meters and also the room temperature.
To my eye there is no correlation between the signals. Its clear that c1susvme2 (SRM LOAD) is going up and no evidence that its temperature.
|
Attachment 1: Untitled.png
|
|
2042
|
Fri Oct 2 15:11:44 2009 |
rob | Update | Computers | c1susvme2 timing problems update update |
It got worse again, starting with locking last night, but it has not recovered. Attached is a 3-day trend of SRM cpu load showing the good spell. |
Attachment 1: srmcpu3.png
|
|
2067
|
Thu Oct 8 11:10:50 2009 |
josephb, jenne | Update | Computers | EPICs Computer troubles |
At around 9:45 the RFM/FB network alarm went off, and I found c1asc, c1lsc, and c1iovme not responding.
I went out to hard restart them, and also c1susvme1 and c1susvme2 after Jenne suggested that.
c1lsc seemed to have a promising come back initially, but not really. I was able to ssh in and run the start command. The green light under c1asc on the RFMNETWORK status page lit, but the reset and CPU usage information is still white, as if its not connected. If I try to load an LSC channel, say like PD5_DC monitor, as a testpoint in DTT it works fine, but the 16 Hz monitor version for EPICs is dead. The fact that we were able to ssh into it means the network is working at least somewhat.
I had to reboot c1asc multiple times (3 times total), waiting a full minute on the last power cycle, before being able to telnet in. Once I was able to get in, I restarted the startup.cmd, which did set the DAQ-STATUS to green for c1asc, but its having the same lack of communication as c1lsc with EPICs.
c1iovme was rebooted, was able to telnet in, and started the startup.cmd. The status light went green, but still no epics updates.
The crate containing c1susvme1 and c1susvme2 was power cycled. We were able to ssh into c1susvme1 and restart it, and it came back fully. Status light, cpu load and channels working. However I c1susvme2 was still having problems, so I power cycled the crate again. This time c1susvme2 came back, status light lit green, and its channels started updating.
At this point, lacking any better ideas, I'm going to do a full reboot, cycling c1dcuepics and proceeding through the restart procedures. |
2068
|
Thu Oct 8 11:37:59 2009 |
josephb | Update | Computers | Reboot of dcuepics helped, c1susvme1 having problems |
Power cycling c1dcuepics seems to have fixed the EPICs channel problems, and c1lsc, c1asc, and c1iovme are talking again.
I burt restored c1iscepics and c1Iosepics from the snapshot at 6 am this morning.
However, c1susvme1 never came back after the last power cycle of its crate that it shared with c1susvme2. I connected a monitor and keyboard per the reboot instructions. I hit ctrl-x, and it proceeded to boot, however, it displays that there's a media error, PXE-E61, suggests testing the cable, and only offers an option to reboot. From a cursory inspection of the front, the cables seem to look okay. Also, this machine had eventually come back after the first power cycle and I'm pretty sure no cables were moved in between.
|
2069
|
Thu Oct 8 14:41:46 2009 |
jenne | Update | Computers | c1susvme1 is back online |
Quote: |
Power cycling c1dcuepics seems to have fixed the EPICs channel problems, and c1lsc, c1asc, and c1iovme are talking again.
I burt restored c1iscepics and c1Iosepics from the snapshot at 6 am this morning.
However, c1susvme1 never came back after the last power cycle of its crate that it shared with c1susvme2. I connected a monitor and keyboard per the reboot instructions. I hit ctrl-x, and it proceeded to boot, however, it displays that there's a media error, PXE-E61, suggests testing the cable, and only offers an option to reboot. From a cursory inspection of the front, the cables seem to look okay. Also, this machine had eventually come back after the first power cycle and I'm pretty sure no cables were moved in between.
|
I had a go at trying to bring c1susvme1 back online. The first few times I hit the physical reset button, I saw the same error that Joe mentioned, about needing to check some cables. I tried one round of rebooting c1sosvme, c1susvme2 and c1susvme1, with no success. After a few iterations of jiggle cables/reset button/ctrl-x on c1susvme1, it came back. I ran the startup.cmd script, and re-enabled the suspensions, and Mode Cleaner is now locked. So, all systems are back online, and I'm crossing my fingers and toes that they stay that way, at least for a little while. |
2080
|
Mon Oct 12 14:51:41 2009 |
rob | Update | Computers | c1susvme2 timing problems update update update |
Quote: |
It got worse again, starting with locking last night, but it has not recovered. Attached is a 3-day trend of SRM cpu load showing the good spell.
|
Last week, Alex recompiled the c1susvme2 code without the decimation filters for the OUT16 channels, so these channels are now as aliased as the rest of them. This appears to have helped with the timing issues: although it's not completely cured it is much better. Attached is a five day trend. |
Attachment 1: srmcpu.png
|
|