40m QIL Cryo_Lab CTN SUS_Lab CAML OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 89 of 355  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  1370   Sun Mar 8 23:09:26 2009 ranaUpdateComputersNot even data retrieval working
Although getting the regular DAQ data works, we can't get any testpoints.

I tried restarting tpman several times; there's no inittab on fb40m for this so we should get Alex to set one up when he comes.
I also tried various power cycles and reboots: daqawg, daqctrl, etc. I also notice that Osamu's setup of new stuff is connected to
the same rack and power strips as all of our sensitive DAQ machines. We should find out if there was any hardware installed in the
last couple days; it would be easy to accidentally unplug or damage on of our fibers.

I moved the old tpman.log over to tpman.log.090308. It starts out with a header and then just lists when each TP is requested.

When restarting tpman it puts the following into the terminal:
fb:controls>./tpman &
[1] 1037
fb:controls>VMIC RFM 5565 (0) found, mapped at 0x2868c90
VMIC RFM 5579 (1) found, mapped at 0x2868c90
Could not open 5565 reflective memory in /dev/daqd-rfm1
16 kHz system
Spawn testpoint manager
Channel list length for node 0 is 4168
Test point manager (31001001 / 1): node 0
which is OK?; its the same startup outputs that are in the old log file. It would be nice if there was not and error message about the RFM.
Requesting new testpoints via tdsdata, dtt, or the diag command line doesn't seem to work. tpman doesn't spit anything out although 'tp show 0'
does show that the TP is selected.

Once Alex fixes the 'tpman' issue, we should make sure to put an inittab or startup script in there so that tpman writes a log
file and also archives its old log files upon a restart.
  1372   Mon Mar 9 10:59:05 2009 AlanOmnistructureComputersssh agent on fb40m restarted for backup

After the boot-fest, the nightly backup to Powell-Booth failed, and an automatic email got sent to me. I restarted the ssh agent, following the instructions in /cvs/cds/caltech/scripts/backup/000README.txt .

  1373   Mon Mar 9 11:09:33 2009 AlbertoUpdateComputersRe: Not even data retrieval working

Quote:
Although getting the regular DAQ data works, we can't get any testpoints.

I tried restarting tpman several times; there's no inittab on fb40m for this so we should get Alex to set one up when he comes.
I also tried various power cycles and reboots: daqawg, daqctrl, etc. I also notice that Osamu's setup of new stuff is connected to
the same rack and power strips as all of our sensitive DAQ machines. We should find out if there was any hardware installed in the
last couple days; it would be easy to accidentally unplug or damage on of our fibers.

I moved the old tpman.log over to tpman.log.090308. It starts out with a header and then just lists when each TP is requested.

When restarting tpman it puts the following into the terminal:
fb:controls>./tpman &
[1] 1037
fb:controls>VMIC RFM 5565 (0) found, mapped at 0x2868c90
VMIC RFM 5579 (1) found, mapped at 0x2868c90
Could not open 5565 reflective memory in /dev/daqd-rfm1
16 kHz system
Spawn testpoint manager
Channel list length for node 0 is 4168
Test point manager (31001001 / 1): node 0
which is OK?; its the same startup outputs that are in the old log file. It would be nice if there was not and error message about the RFM.
Requesting new testpoints via tdsdata, dtt, or the diag command line doesn't seem to work. tpman doesn't spit anything out although 'tp show 0'
does show that the TP is selected.

Once Alex fixes the 'tpman' issue, we should make sure to put an inittab or startup script in there so that tpman writes a log
file and also archives its old log files upon a restart.


Alex fixed the problem. It was caused by the awgtpman running on kami1.martian which conflicted with the tpman in fb0.

Killing awgtpman on kami1 allowed for the tpman on tp0 to work properly again.

If more test points are needed, Alex suggested to tune the GDS settings accordingly.
What this actually means, I still have to understand it.
  1374   Mon Mar 9 12:04:18 2009 YoichiUpdateComputersTPs and AWG are back
I had to do one more reboot of tpman and daqd to get the TPs working.
I confirmed the alignment scripts run fine.

Now the oplevs of some optics are largely mis-centered. Alberto and I will center them after lunch.
  1378   Mon Mar 9 19:27:16 2009 ranaConfigurationComputersMove of the CLIO Digital Controls test setup

Because of the network interference we've had from the CLIO system for the past 3-4 days, I asked the guys to remove

the test stand from the 40m lab area. It is now in the 40m control room. Since it needed an ethernet connection to get out

for some reason we've let them hook into GC. Also, instead of using a real timing signal slaved to the GPS, Jay suggested

just skipping it and having the Timing Slave talk to itself by looping back the fiber with the timing signal. Osamu will enter

more details, but this is just to give a status update.

  1381   Mon Mar 9 23:55:38 2009 OsamuDAQComputersbscteststand and kami1 outside martian

This morning there was a confliction of tpman running on fb40m and kami1. Alex fixed it temporary but Rana suggested it was better to move both PCs outside martian. We moved both PCs physically to the control room and connected to general network with a local router. I believe it won't conflict anymore but if you guess these PC might have trouble please feel free to shutdown.

 

Today's work summary:

 *connected expansion chassis to bscteststand

 *obtained signals on dataviewer, dtt for both realtime and past data on bscteststand with 64kHz timing signal

 

Questions:

Excitation channels are not shown, only "other" is shown.

qts.mdl should run with 16kHz but 16kHz timing causes a slow speed on dataviewer and failing data aquisition on dtt. We are using 64kHz timing but is it really correct?

  1404   Sun Mar 15 21:50:29 2009 Kakeru, Kiwamu, OsamuUpdateComputersSome computers are rebooted

We found c1lsc, c1iscex, c1iscey, c1susvme, c1asc and c1sosvme are dead.
We  turned off all watchdogs and turned off all lock of suspensions.
Then, I tried to reboot these machines from terminal, but I couldn't login to all of these machines.

So, we turned off and on key switches of these machines physically, and login to them to run startup scripts.

Then we turned on all watchdogs and restored all IFO.

Now they look like they are working fine.
 

  1457   Tue Apr 7 21:39:57 2009 YoichiConfigurationComputersLSC code recompiled with a fix for denormalization problem
This is not my work but I will put it for the record.

A few days ago, Rob recompiled the LSC code with the fix of the denormalization problem provided by Alex.
Since then, the LSC code has been working fine. I recognize that c1lsc is now less loaded.

I believe Rob only recompiled the LSC code, so there could still be the problem in the suspension controllers.
  1460   Wed Apr 8 18:18:33 2009 ranaConfigurationComputersLSC code recompiled with a fix for denormalization problem
Below is the link to the anti-denormalization technique that Rolf and Alex implemented at the sites,
that was pointed out by Chris Wipf from MIT:

http://www.musicdsp.org/files/denormal.pdf
  1467   Fri Apr 10 01:24:08 2009 ranaUpdateComputersallegra update (sort of)

I tried to play an .avi file on allegra. In a normal universe this would be easy, but because its linux I was foiled.

The default video player (Totem) doesn't play .avi or .wmv format. The patches for this work in Suse but not Fedora. Kubuntu but not CentOS, etc.I also tried installing Kplayer, Kaffeine, mplayer, xine, Aktion, Realplay, Helix, etc. They all had compatibility issues with various things but usuallylibdvdread or some gstreamer plugin.So I pressed the BIG update button. This has now started and allegra may never recover. The auto update wouldn't work in default mode becauseof the libdvdread and gstreamer-ugly plugins, so I unchecked those boxes. I think we're going to have this problem as long as we used any kind ofadvanced gstreamer stuff for the GigE cameras (which is unavoidable).

 

  1474   Sun Apr 12 01:19:30 2009 YoichiConfigurationComputersNew FE codes for suspensions not successful
Alex recompiled the suspension FE codes for c1susvme1 and c1susvme2 to fix the denormalization problem.
The new modules are in
/cvs/cds/caltech/users/alex/cds/rts/src/fe/40m/losLinux1.o
/cvs/cds/caltech/users/alex/cds/rts/src/fe/40m/losLinux2.o

I tried them today, but c1susvme1 did not work with the new code while c1susvme2 seemed to run ok.
So I reverted the modules (losLinux1.o and losLinux2.o) to the original ones.
The original modules are also backed up as losLinux1.o.11Apr09 and losLinux2.o.11Apr09 in the corresponding target directories.

I reported the problem to Alex.
  1479   Mon Apr 13 18:57:03 2009 AlbertoFrogsComputersGPIB/ETH Interface Troubles

I really don't understand why my programs that I used to use to get data from the HP Spectrum Analyzer and the Marconi frequency generator don't work anymore.

I spent hours trying to debug the code but I can't sort the problem out.

The main problem seem to be with the function recv from the socket library. Somehow it can't anymore get any data from the instruments. The thing I can't understand, though, is that if called directly from the python terminal it works fine!

In particular the problem is with the following lines in my code:

netSock.send("mkpk;mka?\n")
netSock.send("++read eoi\n")
tmp = netSock.recv(1024)

Tried a lot of tickering but it didn't work.

I attach the two scripts I've been using. One (sweepfrequencyPRC.py) calls the other (HP4395PRC.py).

They worked egregiously for weeks in the past. Don't know what happened since then.

Attachment 1: sweepfrequencyPRC.py
## sweepfrequency.py [-f filename] [-i ip_address] [-a startFreq] [-z endFreq] [-s stepFreq] [-m numAvg]
#
## This script sweeps the frequency of a Marconi local oscillator, within the range
## delimited by startFreq and endFreq, with a step set by stepFreq. An arbitary
## signal is monitored on a HP8590 spectrum analyzer and the scripts records the
## amplitude of the spectrum at the frequency injected by the Marconi at the moment.

## The GPIB address of the Marconi is assumed to be 17, that of the HP Spectrum Analyzer to be 18

## Alberto Stochino, October 2008
... 53 more lines ...
Attachment 2: HP8590PRC.py
# This function provides the measuremeent of the peak amplitude on the spectrum analyzer
# HP8590 analyzer while sweeping the excitation frequency on the function generator.
#
# Alberto Stochino 2008

import re
import sys
import math
from optparse import OptionParser
from socket import *
... 70 more lines ...
  1481   Tue Apr 14 12:10:11 2009 AlbertoFrogsComputersGPIB/ETH Interface Troubles

Quote:

I really don't understand why my programs that I used to use to get data from the HP Spectrum Analyzer and the Marconi frequency generator don't work anymore.

I spent hours trying to debug the code but I can't sort the problem out.

The main problem seem to be with the function recv from the socket library. Somehow it can't anymore get any data from the instruments. The thing I can't understand, though, is that if called directly from the python terminal it works fine!

In particular the problem is with the following lines in my code:

netSock.send("mkpk;mka?\n")
netSock.send("++read eoi\n")
tmp = netSock.recv(1024)

Tried a lot of tickering but it didn't work.

I attach the two scripts I've been using. One (sweepfrequencyPRC.py) calls the other (HP4395PRC.py).

They worked egregiously for weeks in the past. Don't know what happened since then.

This morning Joe looked at my code and made me notice that for some reason the query to the Spectrum Analyzer made by netSock.recv(1024) contained two answers. It was like the buffer contained the answer two different queries.

After some experiment I found that basically the GPIB interface wasn't switching from the "auto 1" to the "auto 0" mode as it should. I rewrote part of the code and that seemed have solved the problem.

Still don't understand why it used to work in the past and then it stopped.

  1483   Wed Apr 15 02:18:42 2009 ranaConfigurationComputersnodus vfstab changed for rigel

nodus was hanging because it was trying to mount the cit40m account from rigel and rigel was not responding.

Neither I nor Yoichi can recall what the cit40m account does for us on nodus anymore and so I commented it out of the nodus /etc/vfstab.

nodus may still need a boot to make it pay attention. I was unable to do a 'umount' to get rid of the rigel parasite. But mainly I don't want anything in

the 40m to depend on the LIGO GC system if at all possible.

  1508   Thu Apr 23 13:55:43 2009 josephb, peterUpdateComputersRCG example

We successfully compiled and installed the Real time Code Generator "Hello World" example (which is a skeleton for the ETMX suspension controller) on megatron.  In order to get it to compile, we had to add a flag indicating the computer is stand alone, and not using a myrinet card at the moment.  This was done by adding the shmem_daq = 1 flag to the cdsParameters module.  The symptom was it was unable to find gm.h (and there is no installed /opt/gm directory).

It is called "sam".  It was installed to /cvs/cds/caltech/target/sam, and produced medm screens in /cvs/cds/caltech/medm/c1/sam.  As nothing points to these, I figure it won't harm any of the current configuration, but lets us play around a bit.  If by some strange reason, these do cause problems, feel free to remove them.

  1551   Wed May 6 16:56:35 2009 rana, alex, joeConfigurationComputersdaqd log, cront, etc.
While Alex came over, we investigated the log file problems with DAQD and NDS on FB0. There was a lot of
the standard puzzling and mumbling, but eventually we saw that it doesn't create its log file and so it
doesn't write to it. The log file is /usr/controls/main_daqd.log. The other files called daqd.log.DATE
in the logs/ directory are actually not written to. Its awesome.

We also have put in a fix for the overflowing jobs/ directory. It gets a file written to it every time
you make and NDS request and our seisBLRMS has been overloading it. There's now a cron for it in the fb0
crontab which cleans out week-old files at 6:30 AM every day.

We also changed the time of the daily backup from 3:30 AM (when people are still working) to 5:50 AM
(by which time the seismic has ramped up and interferometerists should be asleep). I didn't like the
idea of a bandwidth hog nailing the framebuilder during the peak of interferometer work.

#
# Script to backup via rsync the most recent 40m minute trends and
# any changes to the /cvs/cds filesystem.
#
50 05 * * * /cvs/cds/caltech/scripts/backup/rsync.backup < /dev/null > /cvs/cds/caltech/scripts/\
backup/rsync.backup.log 2>&1

30 06 * * * find /usr/controls/jobs -mtime +7 -exec /bin/rm -f {} \;

seisBLRMS.m restarted on mafalda.
  1554   Thu May 7 12:21:36 2009 josephb, alexConfigurationComputersfb40m

Having determined that Rana (the computer) was having to many issues with testing the new Raid array due to age of the system, we proceeded to test on fb40m.

 

We brought it down and up several times between 11 and noon.  We  eventually were able to daisy chain the old raid and the new raid so that fb40m sees both.  At this time, the RAID arrays are still daisy chained, but the computer is setup to run on just the original raid, while the full 14 TB array is initialized (16 drives, 1 hot spare, RAID level 5 means 14 TB out of the 16 TB are actually available).  We expect this to take a few hours, at which point we will copy the data from the old RAID to the new RAID (which I also expect to take several hours).  In the meantime, operations should not be affected.  If it is, contact one of us.

 

 

  1555   Thu May 7 15:22:19 2009 josephb, albertoConfigurationComputersfb40m

Quote:

Having determined that Rana (the computer) was having to many issues with testing the new Raid array due to age of the system, we proceeded to test on fb40m.

 

We brought it down and up several times between 11 and noon.  We  eventually were able to daisy chain the old raid and the new raid so that fb40m sees both.  At this time, the RAID arrays are still daisy chained, but the computer is setup to run on just the original raid, while the full 14 TB array is initialized (16 drives, 1 hot spare, RAID level 5 means 14 TB out of the 16 TB are actually available).  We expect this to take a few hours, at which point we will copy the data from the old RAID to the new RAID (which I also expect to take several hours).  In the meantime, operations should not be affected.  If it is, contact one of us.

 

 

 

 

This afternoon the alignment script chrashed after returning sysntax errors. We found that the tpman wasn't running on the framebuilder becasue it had probably failed to get restarted in one of the several reboots executed in the morning by Alex and Jo.

Restarting the tpman was then sufficient for the alignment scripts to get back to work.

  1564   Fri May 8 10:05:40 2009 AlanOmnistructureComputersRestarted backup since fb40m was rebooted

Restarted backup since fb40m was rebooted.

  1574   Mon May 11 12:25:03 2009 josephb,AlexUpdateComputersfb40m down for patching

The 40m frame builder is currently being patched to be able utilize the full 14 TB of the new raid array (as opposed to being limited to 2 TB).  This process is expected to take several hours, during which the frame builder will be unavailable.

  1589   Fri May 15 14:05:14 2009 DmassHowToComputersHow To: Crash the Elog

The Elog started crashing last night. It turns out I was the culprit, and whenever I tried to upload a certain 500kb .png picture, it would die. It has happened both when choosing "upload" of a picture, and when choosing "submit" after successfully uploading a picture. Both culprits were ~500kb .png files.

  1622   Fri May 22 17:05:24 2009 rob, peteUpdateComputershard reboot of vertex suspension controllers

we did a hard reboot of c1susvme1, c1susvme2, c1sosvme, and c1susaux.  We are hoping this will fix some of the weird suspension issues we've been having (MC3 side coil, ITMX alignment).

  1623   Sun May 24 11:24:08 2009 robUpdateComputerselog restarted

I just restarted the elog.  It was crashed for unknown reasons.  The restarting instructions are in the wiki.

  1634   Sat May 30 12:36:52 2009 robUpdateComputersc1susvme2, c1iscex running late

c1susvme2 has been running just a bit late for about a week.  I rebooted it. 

The plot shows SRM_FE_SYNC, which is the number of times in the last second that c1susvme2 was late for the 16k cycle.   Similarly for ETMX.

 

Attachment 1: srmsync.jpg
srmsync.jpg
Attachment 2: etmxsync.jpg
etmxsync.jpg
  1635   Mon Jun 1 13:25:00 2009 robUpdateComputersc1susvme2, c1iscex running late

Quote:

c1susvme2 has been running just a bit late for about a week.  I rebooted it. 

The plot shows SRM_FE_SYNC, which is the number of times in the last second that c1susvme2 was late for the 16k cycle.   Similarly for ETMX.

 

 

The reboot appears to have worked.

Attachment 1: doublesync.jpg
doublesync.jpg
  1642   Tue Jun 2 23:12:08 2009 robConfigurationComputersntp on op440m

I restarted ntpd on op440m to solve a "synchronization error" that we were having in DTT.  I also edited the config file (/etc/inet/ntp.conf) to remove the lines referring to rana as an ntp server; now only nodus is listed.

To do this:

log in as root

/usr/local/bin/ntpd -c /etc/inet/ntp.conf

  1643   Tue Jun 2 23:53:12 2009 peteDAQComputersreset c1susvme1

rob, alberto, rana, pete

we reset this computer, which was out of sync (16384 in the FE_SYNC field instead of 0)

  1657   Fri Jun 5 16:45:28 2009 rob, peteHowToComputerstdsavg failure in cm_step script

Quote:

Quote:

the command

tdsavg 5 C1:LSC-PD4_DC_IN1

was causing grievous woe in the cm_step script.  It turned out to fail intermittently at the command line, as did other LSC channels.  (But non-LSC channels seem to be OK.)  So we power cycled c1lsc (we couldn't ssh).

Then we noticed that computers were out of sync again (several timing fields said 16383 in the C0DAQ_RFMNETWORK screen).  We restarted c1iscey, c1iscex, c1lsc, c1susvme1, and c1susvme2.  The timing fields went back to 0.  But the tdsavg command still  intermittently said "ERROR: LDAQ - SendRequest - bad NDS status: 13".

The channel C1:LSC-SRM_OUT16 seems to work with tdsavg every time.

Let us know if you know how to fix this. 

 

 Did you try restarting the framebuilder?

 

What you type is in bold:

op440m> telnet fb40m 8087

daqd> shutdown

 

Restarting the framebuilder didn't work, but the problem now appears to be fixed.

Upon reflection, we also decided to try killing all open DTT and Dataviewer windows.  This also involved liberal use of ps -ef to seek out and destroy all diag's, dc3's, framer4's, etc.

 

That may have worked, but it happened simultaneously to killing the tpman process on fb40m, so we can't be sure which is the actual solution.

 

To restart the testpoint manager:

what you type is in bold:

rosalba> ssh fb40m

fb40m~> pkill tpman

The tpman is actually immortal, like Voldemort or the Kurgan or the Cylons in the new BG.  Truly slaying it requires special magic, so the pkill tpman command has the effect of restarting it.

 

In the future, we should make it a matter of policy to close DTTs and Dataviewers when we're done using them, and killing any unattended ones that we encounter.

 

  1668   Thu Jun 11 14:54:18 2009 josephb, albertoUpdateComputersWireless network

After poking around for a few minutes several facts became clear:

1) At least one GPIB interface has a hard ethernet connection (and does not currently go through the wireless).

2) The wireless on the laptop works fine, since it can connect to the router.

3) The rest of the martian network cannot talk to the router.

This led to me replugging the ethernet cord back into the wireless router, which at some point in the past had been unplugged.  The computers now seem to be happy and can talk to each other.

 

  1682   Wed Jun 17 01:07:50 2009 robConfigurationComputersmatapps on /cvs/cds

I checked out a copy of matapps into /cvs/cds/caltech/apps/lscsoft so that I could find the matlab function strassign.m, which is necessary for some old mDV commands to run.  I don't know why it became necessary or why it disappeared if it did.

  1683   Wed Jun 17 01:09:47 2009 robUpdateComputers/cvs/cds 91% full

In /cvs/cds/caltech

 


1.6M    2008-8-15.pdf
2.9M    40mUpgradeOpticalLayoutPlan01.pdf
2.4M    alh
19M     apache
18G     apps
11M     archive
4.0K    authorized_keys2
8.0K    backup.notes
8.0K    backup.notes~
1.9G    build
62G     burt
47M     cds
13M     cds40m
37M     chans
70G     conlog
52K     crontab
12K     cshrc.40m
12K     cshrc.40m~
36M     diag
1.4G    dmt
8.2M    framecpp-0.2.0
1.7M    free_080730.pdf
57M     gds
9.8G    home
60K     hooks
8.0K    hosts.40m
4.0K    id_rsa
10M     iscmodeling
110M    ldg-4.7
648M    libs
4.0K    log2.txt
224K    logs
0       log.txt
238M    medm
344M    NB
148M    NB_080304
211M    NB_080307
401M    NB40
1.2G    noisebudget.071109
837M    noisebudget.bak.20060623
3.5M    oldtarget
123M    root
5.7M    savesets
208K    schematics
655M    scripts
13G     scripts_archive
1.1M    state
3.7G    svn
4.0K    svn-commit.tmp
7.3G    target
295M    target_archive
6.7M    test
72K     test.png
4.0K    tmp
8.0K    typescript
35G     users
205M    wind

  1686   Fri Jun 19 13:38:42 2009 AlbertoConfigurationComputerselog rebooted

Today I found the elog down, so I rebooted it following the instructions in the wiki.

  1688   Fri Jun 19 14:30:47 2009 AlbertoConfigurationComputerselog rebooted

Quote:

Today I found the elog down, so I rebooted it following the instructions in the wiki.

 I have the impression that Nodus has been rebooted since last night, hasn't it?

  1689   Sun Jun 21 00:08:26 2009 ranaConfigurationComputerselog rebooted

 

nodus:log>dmesg

Sun Jun 21 00:06:43 PDT 2009
Mar  6 15:46:32 nodus sshd[26490]: [ID 800047 auth.crit] fatal: Timeout before authentication for 131.215.114.93
Mar 10 11:11:32 nodus sshd[22775]: [ID 800047 auth.crit] fatal: Timeout before authentication for 131.215.114.93
Mar 11 13:27:37 nodus sshd[7768]: [ID 800047 auth.error] error: connect_to 131.215.115.52 port 7000: Connection refused
Mar 11 13:27:37 nodus sshd[7768]: [ID 800047 auth.error] error: connect_to nodus port 7000: failed.
Mar 11 13:31:40 nodus sshd[7768]: [ID 800047 auth.error] error: connect_to 131.215.115.52 port 7000: Connection refused
Mar 11 13:31:40 nodus sshd[7768]: [ID 800047 auth.error] error: connect_to nodus port 7000: failed.
Mar 11 13:31:45 nodus sshd[7768]: [ID 800047 auth.error] error: connect_to 131.215.115.52 port 7000: Connection refused
Mar 11 13:31:45 nodus sshd[7768]: [ID 800047 auth.error] error: connect_to nodus port 7000: failed.
Mar 11 13:34:58 nodus sshd[7768]: [ID 800047 auth.error] error: connect_to 131.215.115.52 port 7000: Connection refused
Mar 11 13:34:58 nodus sshd[7768]: [ID 800047 auth.error] error: connect_to nodus port 7000: failed.
Mar 12 16:09:23 nodus sshd[22785]: [ID 800047 auth.crit] fatal: Timeout before authentication for 131.215.114.93
Mar 14 20:14:42 nodus sshd[13563]: [ID 800047 auth.crit] fatal: Timeout before authentication for 131.215.114.93
Mar 25 19:47:19 nodus sudo: [ID 702911 local2.alert] controls : 3 incorrect password attempts ; TTY=pts/2 ; PWD=/cvs/cds ; USER=root ; COMMAND=/usr/bin/rm -rf kamioka/
Mar 25 19:48:46 nodus su: [ID 810491 auth.crit] 'su root' failed for controls on /dev/pts/2
Mar 25 19:49:17 nodus last message repeated 2 times
Mar 25 19:51:14 nodus sudo: [ID 702911 local2.alert] controls : 1 incorrect password attempt ; TTY=pts/2 ; PWD=/cvs/cds ; USER=root ; COMMAND=/usr/bin/rm -rf kamioka/
Mar 25 19:51:22 nodus su: [ID 810491 auth.crit] 'su root' failed for controls on /dev/pts/2
Jun  8 16:12:17 nodus su: [ID 810491 auth.crit] 'su root' failed for controls on /dev/pts/4

nodus:log>uptime
 12:06am  up 150 day(s), 11:52,  1 user,  load average: 0.05, 0.07, 0.07

  1699   Wed Jun 24 14:43:19 2009 robUpdateComputerstdsresp on linux => pzresp

 

tdsresp is broken on our linux control room machines.  I made a little perl replacement which uses the DiagTools.pm perl module, called pzresp.  It's in the $SCRIPTS/general directory, and so in the path of all the machines.  I also edited the cshrc.40m file so that on linux machines tdsresp points to this perl replacement.

I've patched DiagTools.pm to circumnavigate the tdsdmd bug described here.  I also added a function to DiagTools.pm called diagRespNoLog, which is just like diagResp but without that pesky log file.

 


Here's the output from the tdsresp binary on CentOS:
allegra:~>tdsresp 941.54 10000 100 10 C1:LSC-ITMX_EXC C1:LSC-PD1_Q C1:LSC-PD1_I
nan nan nan nan nan nan nan
nan nan nan nan nan nan nan
nan nan nan nan nan nan nan
*** glibc detected *** tdsresp: free(): invalid next size (fast): 0x089483e8 ***
======= Backtrace: =========
/lib/libc.so.6[0xa800f1]
/lib/libc.so.6(cfree+0x90)[0xa83bc0]
/usr/lib/libstdc++.so.6(_ZdlPv+0x21)[0xf7f36571]
tdsresp[0x8057fbb]
tdsresp[0x805b394]
/lib/libc.so.6(__libc_start_main+0xdc)[0xa2ce8c]
tdsresp(__gxx_personality_v0+0x169)[0x804ddd1]
======= Memory map: ========
00242000-00249000 r-xp 00000000 fd:00 15400987                           /lib/librt-2.5.so
00249000-0024a000 r--p 00006000 fd:00 15400987                           /lib/librt-2.5.so
0024a000-0024b000 rw-p 00007000 fd:00 15400987                           /lib/librt-2.5.so
009f9000-00a13000 r-xp 00000000 fd:00 15400963                           /lib/ld-2.5.so
00a13000-00a14000 r--p 00019000 fd:00 15400963                           /lib/ld-2.5.so
00a14000-00a15000 rw-p 0001a000 fd:00 15400963                           /lib/ld-2.5.so
00a17000-00b55000 r-xp 00000000 fd:00 15400974                           /lib/libc-2.5.so
00b55000-00b57000 r--p 0013e000 fd:00 15400974                           /lib/libc-2.5.so
00b57000-00b58000 rw-p 00140000 fd:00 15400974                           /lib/libc-2.5.so
00b58000-00b5b000 rw-p 00b58000 00:00 0 
00b5d000-00b70000 r-xp 00000000 fd:00 15400984                           /lib/libpthread-2.5.so
00b70000-00b71000 r--p 00012000 fd:00 15400984                           /lib/libpthread-2.5.so
00b71000-00b72000 rw-p 00013000 fd:00 15400984                           /lib/libpthread-2.5.so
00b72000-00b74000 rw-p 00b72000 00:00 0 
00b76000-00b78000 r-xp 00000000 fd:00 15400981                           /lib/libdl-2.5.so
00b78000-00b79000 r--p 00001000 fd:00 15400981                           /lib/libdl-2.5.so
00b79000-00b7a000 rw-p 00002000 fd:00 15400981                           /lib/libdl-2.5.so
00b7c000-00ba1000 r-xp 00000000 fd:00 15400975                           /lib/libm-2.5.so
00ba1000-00ba2000 r--p 00024000 fd:00 15400975                           /lib/libm-2.5.so
00ba2000-00ba3000 rw-p 00025000 fd:00 15400975                           /lib/libm-2.5.so
00bca000-00bdd000 r-xp 00000000 fd:00 15401011                           /lib/libnsl-2.5.so
00bdd000-00bde000 r--p 00012000 fd:00 15401011                           /lib/libnsl-2.5.so
00bde000-00bdf000 rw-p 00013000 fd:00 15401011                           /lib/libnsl-2.5.so
00bdf000-00be1000 rw-p 00bdf000 00:00 0 
00dca000-00dd5000 r-xp 00000000 fd:00 15400986                           /lib/libgcc_s-4.1.2-20080825.so.1
00dd5000-00dd6000 rw-p 0000a000 fd:00 15400986                           /lib/libgcc_s-4.1.2-20080825.so.1
08048000-080b7000 r-xp 00000000 00:17 6455328                            /cvs/cds/caltech/apps/linux/tds/bin/tdsresp
080b7000-080ba000 rw-p 0006e000 00:17 6455328                            /cvs/cds/caltech/apps/linux/tds/bin/tdsresp
080ba000-080bb000 rw-p 080ba000 00:00 0 
0893d000-0896b000 rw-p 0893d000 00:00 0                                  [heap]
f5e73000-f5e74000 ---p f5e73000 00:00 0 
f5e74000-f6874000 rw-p f5e74000 00:00 0 
f692d000-f6931000 r-xp 00000000 fd:00 15400995                           /lib/libnss_dns-2.5.so
f6931000-f6932000 r--p 00003000 fd:00 15400995                           /lib/libnss_dns-2.5.so
f6932000-f6933000 rw-p 00004000 fd:00 15400995                           /lib/libnss_dns-2.5.so
f6956000-f6a12000 rw-p f6a31000 00:00 0 
f6a74000-f6a7d000 r-xp 00000000 fd:00 15400997                           /lib/libnss_files-2.5.so
f6a7d000-f6a7e000 r--p 00008000 fd:00 15400997                           /lib/libnss_files-2.5.so
f6a7e000-f6a7f000 rw-p 00009000 fd:00 15400997                           /lib/libnss_files-2.5.so
f6a7f000-f6a80000 ---p f6a7f000 00:00 0 
f6a80000-f7480000 rw-p f6a80000 00:00 0 
f7480000-f7481000 ---p f7480000 00:00 0 
f7481000-f7e83000 rw-p f7481000 00:00 0 
f7e83000-f7f63000 r-xp 00000000 fd:00 6236924                            /usr/lib/libstdc++.so.6.0.8
f7f63000-f7f67000 r--p 000df000 fd:00 6236924                            /usr/libAbort

 
  1702   Thu Jun 25 17:27:42 2009 ranaUpdateComputerstdsresp on linux
I downloaded the tdsresp.pl from LLO and put it into the various TDS/bin paths. Also updated the LLO specific path stuff. It runs.
  1722   Wed Jul 8 11:13:36 2009 AlbertoOmnistructureComputerswireless router disconnected

Once again, this morning I found the wireless router disconnected from the LAN cable. No martian WiFi was available.

I wonder who is been doing that and for what reason.

  1733   Sun Jul 12 20:06:44 2009 JenneDAQComputersAll computers down

I popped by the 40m, and was dismayed to find that all of the front end computers are red (only framebuilder, DAQcontroler, PEMdcu, and c1susvmw1 are green....all the rest are RED).

 

I keyed the crates, and did the telnet.....startup.cmd business on them, and on c1asc I also pushed the little reset button on the physical computer and tried the telnet....startup.cmd stuff again.  Utter failure. 

 

I have to pick someone up from the airport, but I'll be back in an hour or two to see what more I can do.

  1735   Mon Jul 13 00:34:37 2009 AlbertoDAQComputersAll computers down

Quote:

I popped by the 40m, and was dismayed to find that all of the front end computers are red (only framebuilder, DAQcontroler, PEMdcu, and c1susvmw1 are green....all the rest are RED).

 

I keyed the crates, and did the telnet.....startup.cmd business on them, and on c1asc I also pushed the little reset button on the physical computer and tried the telnet....startup.cmd stuff again.  Utter failure. 

 

I have to pick someone up from the airport, but I'll be back in an hour or two to see what more I can do.

 I think the problem was caused by a failure of the RFM network: the RFM MEDM screen showed frozen values even when I was power recycling any of the FE computers. So I tried the following things:

- resetting the RFM switch
- power cycling the FE computers
- rebooting the framebuilder
 
but none of them worked.  The FEs didn't come back. Then I reset C1DCU1 and power cycled C1DAQCTRL.
 
After that, I could restart the FEs by power recycling them again. They all came up again except for C1DAQADW. Neither the remote reboot or the power cycling could bring it up.
 
After every attempt of restarting it its lights on the DAQ MEDM  screen turned green only for a fraction of a second and then became red again.
 
So far every attempt to reanimate it failed.
  1736   Mon Jul 13 00:53:50 2009 AlbertoDAQComputersAll computers down

Quote:

Quote:

I popped by the 40m, and was dismayed to find that all of the front end computers are red (only framebuilder, DAQcontroler, PEMdcu, and c1susvmw1 are green....all the rest are RED).

 

I keyed the crates, and did the telnet.....startup.cmd business on them, and on c1asc I also pushed the little reset button on the physical computer and tried the telnet....startup.cmd stuff again.  Utter failure. 

 

I have to pick someone up from the airport, but I'll be back in an hour or two to see what more I can do.

 I think the problem was caused by a failure of the RFM network: the RFM MEDM screen showed frozen values even when I was power recycling any of the FE computers. So I tried the following things:

- resetting the RFM switch
- power cycling the FE computers
- rebooting the framebuilder
 
but none of them worked.  The FEs didn't come back. Then I reset C1DCU1 and power cycled C1DAQCTRL.
 
After that, I could restart the FEs by power recycling them again. They all came up again except for C1DAQADW. Neither the remote reboot or the power cycling could bring it up.
 
After every attempt of restarting it its lights on the DAQ MEDM  screen turned green only for a fraction of a second and then became red again.
 
So far every attempt to reanimate it failed.

 

After Alberto's bootfest which was more successful than mine, I tried powercycling the AWG crate one more time.  No success.  Just as Alberto had gotten, I got the DAQ screen's AWG lights to flash green, then go back to red.  At Alberto's suggestion, I also gave the physical reset button another try.  Another round of flash-green-back-red ensued.

When I was in a few hours ago while everything was hosed, all the other computer's 'lights' on the DAQ screen were solid red, but the two AWG lights were flashing between green and red, even though I was power cycling the other computers, not touching the AWG at the time.  Those are the lights which are now solid red, except for a quick flash of green right after a reboot.

I poked around in the history of the curren and old elogs, and haven't found anything referring to this crazy blinking between good and bad-ness for the AWG computers.  I don't know if this happens when the tpman goes funky (which is referred to a lot in the annals of the elog in the same entries as the AWG needing rebooting) and no one mentions it, or if this is a new problem.  Alberto and I have decided to get Alex/someone involved in this, because we've exhausted our ideas. 

  1737   Mon Jul 13 15:14:57 2009 AlbertoUpdateComputersDAQAWG

Today Alex came over, performed his magic rituals on the DAQAWG computer and fixed it. Now it's up and running again.

I asked him what he did, but he's not sure of what fixed it. He couldn't remember exactly but he said that he poked around, did something somewhere somehow, maybe he tinkered with tpman and eventually the computer went up again.

Now everything is fine.

  1743   Tue Jul 14 14:54:19 2009 steveConfigurationComputersfb40m2 in 1Y6

Alex and Steve,

SunFire x4600 ( not  MEGATRON 2 , it is fb40m2 ) and JetStor ( 16 x 1 TB drives ) were installed on side rails at the bottom of 1Y6

We cleaned up the fibres and cabling in 1Y7 also

  1746   Wed Jul 15 08:59:30 2009 steveUpdateComputersfb40m

The fb40m just went out of order with status indicator number 8

It recovered on its own five minutes later.

  1752   Wed Jul 15 17:18:24 2009 JenneDAQComputersDAQAWG gone, now back

Yet again, the DAQAWG flipped out for an unknowable reason.  In order of restart activities listed on the Wiki, I keyed the crate and nothing really happened, then I hit the physical reset button and nothing happened, and then I did the 'telnet....vmeBusReset', and a couple minutes later, it was all good again.

  1756   Thu Jul 16 09:49:52 2009 AlanUpdateComputersfb40m

Quote:

The fb40m just went out of order with status indicator number 8

It recovered on its own five minutes later.

 Backup script restarted, backup of trend frames and /cvs/cds is up-to-date.

 

  1766   Tue Jul 21 01:11:30 2009 DmassAoGComputersAlarms going off

I came into the 40m to sign things out briefly then swiftly return them, and the alarms were going off on op540m at 1am.

 

The cat and donkey? were making much noise.

  1771   Tue Jul 21 18:46:47 2009 steveConfigurationComputerscomputers are down

All suspentions are kicked up. Sus dampings and  oplev servos turned off.

c1iscey and c1lsc are down. c1asc and c1iovme are up-down

  1772   Wed Jul 22 01:57:19 2009 AlbertoConfigurationComputerscomputers are down

Quote:

All suspentions are kicked up. Sus dampings and  oplev servos turned off.

c1iscey and c1lsc are down. c1asc and c1iovme are up-down

 The computers and RFM network are up working again. A boot fest was necessary.Then I restored all the parameters with burtgooey.

The mode cleaner alignment is in a bad state. The autolocker can't get it locked. I don't know what caused it to move so far from the good state that it was till this afternoon.  I went tuning the periscope but the cavity alignment is so bad that it's taking more time than expected. I'll continue working on that tomorrow morning.

  1773   Wed Jul 22 09:04:10 2009 AlbertoConfigurationComputerscomputers are down

Quote:

Quote:

All suspentions are kicked up. Sus dampings and  oplev servos turned off.

c1iscey and c1lsc are down. c1asc and c1iovme are up-down

 The computers and RFM network are up working again. A boot fest was necessary.Then I restored all the parameters with burtgooey.

The mode cleaner alignment is in a bad state. The autolocker can't get it locked. I don't know what caused it to move so far from the good state that it was till this afternoon.  I went tuning the periscope but the cavity alignment is so bad that it's taking more time than expected. I'll continue working on that tomorrow morning.

 I now suspect that after the reboot the MC mirrors didn't really go back to their original place even if the MC sliders were on the same positions as before.

  1776   Wed Jul 22 11:14:11 2009 AlbertoConfigurationComputerscomputers are down

Quote:

Quote:

Quote:

All suspentions are kicked up. Sus dampings and  oplev servos turned off.

c1iscey and c1lsc are down. c1asc and c1iovme are up-down

 The computers and RFM network are up working again. A boot fest was necessary.Then I restored all the parameters with burtgooey.

The mode cleaner alignment is in a bad state. The autolocker can't get it locked. I don't know what caused it to move so far from the good state that it was till this afternoon.  I went tuning the periscope but the cavity alignment is so bad that it's taking more time than expected. I'll continue working on that tomorrow morning.

 I now suspect that after the reboot the MC mirrors didn't really go back to their original place even if the MC sliders were on the same positions as before.

 Alberto, Rob,

we diagnosed the problem. It was related with sticky sliders. After a reboot of C1:IOO the actual output of the DAC does not correspond anymore to the values read on the sliders. In order to update the actual output it is necessary to do a change of the values of the sliders, i.e. jiggling a bit with them.

ELOG V3.1.3-