40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 47 of 344  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  9387   Thu Nov 14 22:23:22 2013 JenneUpdateCDSCan't talk to AUXEY?

Quote:

The restore scripts from the IFO config screen half-failed, with this error:

retrying (1/5)...
retrying (2/5)...
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "c1auxey.martian:5064"
    Source File: ../cac.cpp line 1214
    Current Time: Wed Nov 13 2013 17:24:00.389261330
..................................................................

Jamie, do you know what this might be?  When requested, ETMY was not misaligned or restored, but we got these errors.  So, somehow we're not talking properly to EY, but other things seem fine (the models are running okay, the suspension is damped, etc, etc.)

 This problem is now worse - the sliders on IFO_ALIGN for ETMY are white.  I can't telnet to the machine either, although auxex works okay.  Rather, it looks like maybe I'm getting to auxey, but then I'm immediately booted.  I can ping both c1auxex and c1auxey with no problem.
 

Heeeeelllp please.  Is this just a "shut off, then turn back on" problem?  I'm wary of hard rebooting things, with all the warnings and threats in the elog lately.  I've sent an email to Jamie to ping him.

There are some vague instructions in the wiki, but they begin at doing the burt restores, not actually restarting the computers: wiki  Back in July, elog 8858 was written, from which the wiki instructions seem to be based.  But in the elog it says "...went to the /cvs/cds/caltech/target/ area and started to (one by one) inspect all of the targets to see if they were alive.", but I don't know what "inspected" means in this case.  I probably should, since I've been here for something like a millennia, but I don't.


controls@rossa:~ 0$ telnet c1auxey
Trying 192.168.113.60...
Connected to c1auxey.martian.
Escape character is '^]'.
Connection closed by foreign host.
controls@rossa:~ 1$ telnet c1auxex
Trying 192.168.113.59...
Connected to c1auxex.martian.
Escape character is '^]'.

c1auxex >
telnet> ^]
?Invalid command
telnet> exit
?Invalid command
telnet> quit
Connection closed.
controls@rossa:~ 0$ telnet c1auxey
Trying 192.168.113.60...
Connected to c1auxey.martian.
Escape character is '^]'.
Connection closed by foreign host.
  9391   Fri Nov 15 10:19:26 2013 manasaUpdateCDSCan't talk to AUXEY?

Quote:

 

 This problem is now worse - the sliders on IFO_ALIGN for ETMY are white.  I can't telnet to the machine either, although auxex works okay.  Rather, it looks like maybe I'm getting to auxey, but then I'm immediately booted.  I can ping both c1auxex and c1auxey with no problem.
 

Heeeeelllp please.  Is this just a "shut off, then turn back on" problem?  I'm wary of hard rebooting things, with all the warnings and threats in the elog lately.  I've sent an email to Jamie to ping him.

There are some vague instructions in the wiki, but they begin at doing the burt restores, not actually restarting the computers: wiki  Back in July, elog 8858 was written, from which the wiki instructions seem to be based.  But in the elog it says "...went to the /cvs/cds/caltech/target/ area and started to (one by one) inspect all of the targets to see if they were alive.", but I don't know what "inspected" means in this case.  I probably should, since I've been here for something like a millennia, but I don't.

 

This is what was done (as I recollect) when we said "inspected":Tenet into the computer, ping them and look at the status. Since c1auxey is not responding, here is how c1auxex responds.

controls@rossa:/cvs/cds/caltech/target 0$ telnet c1auxex
Trying 192.168.113.59...
Connected to c1auxex.martian.
Escape character is '^]'.

c1auxex > h
  1  i
  2  -help
  3  --help
  4  h
  5  2
  6  h
  7  -help
  8  i
  9  h
value = 0 = 0x0
c1auxex > i

  NAME        ENTRY       TID    PRI   STATUS      PC       SP     ERRNO  DELAY
---------- ------------ -------- --- ---------- -------- -------- ------- -----
tExcTask   _excTask       fde244   0 PEND          87094   fde1ac   3006b     0
tLogTask   _logTask       fdb944   0 PEND          87094   fdb8a8       0     0
tShell     _shell         ddad00   1 READY         6d974   dda9c8  3d0001     0
tRlogind   _rlogind       fbc11c   2 PEND          2b604   fbbdf4       0     0
tTelnetd   _telnetd       fba278   2 PEND          2b604   fba1a8       0     0
tTelnetOutT_telnetOutTa   db7578   2 READY         2b604   db72e0       0     0
tTelnetInTa_telnetInTas   db6060   2 READY         2b5dc   db5d68       0     0
callback   _callbackTas   f7941c  40 PEND          2b604   f793d4       0     0
scanEvent  ee7ca8         ecacb4  41 PEND          2b604   ecac6c       0     0
tNetTask   _netTask       fd75b8  50 READY         6be6c   fd7550       0     0
scanPeriod ee78f8         ecd554  53 READY         6d192   ecd508       0     0
scanPeriod ee78f8         f23e48  54 DELAY         6d192   f23dfc       0     6
tFtpdTask  _ftpdTask      fb7848  55 PEND          2b604   fb778c       0     0
scanPeriod ee78f8         f266e8  55 READY         6d192   f2669c       0     0
scanPeriod ee78f8         f38678  56 READY         6d192   f3862c       0     0
callback   _callbackTas   f7bcbc  57 PEND          2b604   f7bc74       0     0
scanPeriod ee78f8         f906d8  57 DELAY         6d192   f9068c       0    59
scanPeriod ee78f8         f995ac  58 DELAY         6d192   f99560       0   238
scanPeriod ee78f8         f9c908  59 DELAY         6d192   f9c8bc       0   538
callback   _callbackTas   fa4c1c  65 PEND          2b604   fa4bd4       0     0
scanOnce   ee7764         f9f96c  65 PEND          2b604   f9f92c       0     0
epicsPrint f0501c         e88fa0  70 PEND          2b604   e88f64   c0002     0
ts_Casync  ee5bae         f76b7c  70 DELAY         6d192   f76880  3d0004   178
tPortmapd  _portmapd      fb8d60 100 PEND          2b604   fb8c2c      16     0
EgRam      ea00e4         fa14ac 100 PEND          2b604   fa1458       0     0
CA client  _camsgtask     d85878 180 PEND          2b604   d85774  3d0004     0
CA client  _camsgtask     df91e8 180 PEND          2b604   df90e4       0     0
CA client  _camsgtask     d98bf4 180 PEND          2b604   d98af0       0     0
CA client  _camsgtask     e03cd0 180 PEND          2b604   e03bcc       0     0
CA client  _camsgtask     ddf2b8 180 PEND          2b604   ddf1b4       0     0
CA client  _camsgtask     faaec8 180 PEND          2b604   faadc4       0     0
CA client  _camsgtask     d79f3c 180 PEND          2b604   d79e38       0     0
CA TCP     _req_server    f305dc 181 PEND          2b604   f30540       0     0
CA repeaterf109e2         f215a8 181 PEND          2b604   f21474       0     0
CA event   _event_task    d7fe58 181 PEND          2b604   d7fe10       0     0
CA event   _event_task    d6ce5c 181 PEND          2b604   d6ce14       0     0
CA event   _event_task    dab7e0 181 PEND          2b604   dab798       0     0
CA event   _event_task    d76efc 181 PEND          2b604   d76eb4       0     0
CA event   _event_task    d9bddc 181 PEND          2b604   d9bd94       0     0
CA event   _event_task    d9a864 181 PEND          2b604   d9a81c       0     0
CA event   _event_task    da8d8c 181 PEND          2b604   da8d44       0     0
CA UDP     _cast_server   f2f064 182 READY        efcabe   f2efe4       0     0
CA online  _rsrv_online   f2d84c 183 DELAY         6d192   f2d7bc       0   265
EV save_res_event_task    de88dc 189 PEND          2b604   de8894   3006b     0
save_restor_save_restor   df61cc 190 PEND          2b604   df5c44  3d0002     0
RD save_res_cac_recv_ta   fb47d8 191 READY         2b604   fb46a4  3d0004     0
logRestart f05d42         e861c0 200 PEND+T        2b604   e86174      33  1714
taskwd     ef4d46         e85030 200 DELAY         6d192   e84f7c       0   224
value = 0 = 0x0
c1auxex >
telnet> quit
Connection closed.
controls@rossa:/cvs/cds/caltech/target 0$

  9393   Fri Nov 15 10:49:55 2013 jamieUpdateCDSCan't talk to AUXEY?

Please just try rebooting the vxworks machine.  I think there is a key on the card or create that will reset the device.  These machines are "embeded" so they're designed to be hard reset, so don't worry, just restart the damn thing and see if that fixes the problem.

  9394   Fri Nov 15 12:00:28 2013 KojiUpdateCDSCan't talk to AUXEY?

Quote:

Please just try rebooting the vxworks machine.  I think there is a key on the card or create that will reset the device.  These machines are "embeded" so they're designed to be hard reset, so don't worry, just restart the damn thing and see if that fixes the problem.

 Don't forget to run burtrestore for the target.

  9395   Fri Nov 15 12:38:50 2013 JenneUpdateCDSCan't talk to AUXEY?

Quote:

Please just try rebooting the vxworks machine.  I think there is a key on the card or create that will reset the device.  These machines are "embeded" so they're designed to be hard reset, so don't worry, just restart the damn thing and see if that fixes the problem.

 This is what I remember doing all the time when Rob was around, but with all the new computers, I forgot whether or not this was allowed for the slow computers.

Anyhow, I went down there and keyed the crate, but auxey isn't coming back.  I'll give it a few more minutes and check again, but then I might go and power cycle it again.  If that doesn't work, we may have a much bigger problem.

  9396   Fri Nov 15 13:26:00 2013 JenneUpdateCDSAUXEY is back

Quote:

Quote:

Please just try rebooting the vxworks machine.  I think there is a key on the card or create that will reset the device.  These machines are "embeded" so they're designed to be hard reset, so don't worry, just restart the damn thing and see if that fixes the problem.

 This is what I remember doing all the time when Rob was around, but with all the new computers, I forgot whether or not this was allowed for the slow computers.

Anyhow, I went down there and keyed the crate, but auxey isn't coming back.  I'll give it a few more minutes and check again, but then I might go and power cycle it again.  If that doesn't work, we may have a much bigger problem.

 I went and keyed the crate again, and this time the computer came back.  I burt restored to Nov 10th.  ETMY is damping again.

  9402   Mon Nov 18 21:20:54 2013 JenneUpdateCDSCan't talk to AUXEY?

Quote:

The restore scripts from the IFO config screen half-failed, with this error:

retrying (1/5)...
retrying (2/5)...
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "c1auxey.martian:5064"
    Source File: ../cac.cpp line 1214
    Current Time: Wed Nov 13 2013 17:24:00.389261330
..................................................................

Jamie, do you know what this might be?  When requested, ETMY was not misaligned or restored, but we got these errors.  So, somehow we're not talking properly to EY, but other things seem fine (the models are running okay, the suspension is damped, etc, etc.)

 The auxey machine is back, in that I can interact with the IFO_ALIGN sliders, and they actually make the optic move, but I still can't read and write to and from the EPICs channels:

controls@rossa:/opt/rtcds/caltech/c1/medm/MISC/ifoalign/burt 0$ cdsutils read C1:SUS-ETMY_PIT_COMM
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "c1auxey.martian:5064"
    Source File: ../cac.cpp line 1214
    Current Time: Mon Nov 18 2013 21:13:52.044973819
..................................................................
Could not connect to channel (timeout=2s): C1:SUS-ETMY_PIT_COMM
controls@rossa:/opt/rtcds/caltech/c1/medm/MISC/ifoalign/burt 1$ cdsutils read C1:SUS-ETMY_YAW_COMM
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "c1auxey.martian:5064"
    Source File: ../cac.cpp line 1214
    Current Time: Mon Nov 18 2013 21:14:07.040168660
..................................................................
Could not connect to channel (timeout=2s): C1:SUS-ETMY_YAW_COMM
controls@rossa:/opt/rtcds/caltech/c1/medm/MISC/ifoalign/burt 1$

This is also causing trouble for the BURT save and BURT restore scripts, that are called from the IFO_ALIGN screen.  If I look at the log that is written from an attempted 'save' of the slider values, I see:

**** READ BURT LOGFILE

--- Start processing files
file >/opt/rtcds/caltech/c1/medm/MISC/ifoalign/burt/ETMY.req<
preprocessing ... done
pv >C1:SUS-ETMY_PIT_COMM< nreq=-1
pv >C1:SUS-ETMY_YAW_COMM< nreq=-1
--- End processing files

--- Start searches
C1:SUS-ETMY_PIT_COMM ... ca_search_and_connect() ... OK
C1:SUS-ETMY_YAW_COMM ... ca_search_and_connect() ... OK
--- End searches
Waiting for 2 outstanding search(es) ...
Waiting for 2 outstanding search(es) ...
did not find 2

--- Start reads
C1:SUS-ETMY_PIT_COMM ... not connected so no ca_array_get_callback()
C1:SUS-ETMY_YAW_COMM ... not connected so no ca_array_get_callback()
--- End reads

--- Start wait for pending reads

-- End wait for pending reads 0 outstanding read(s)

**** END BURT LOGFILE

The burt save file has no values in it.  Even if I copy over the ETMX save file and put in the correct channel names and values, a burt restore is unsuccessful. 

So, I can do locking tonight by restoring and misaligning by hand, but this sucks, and needs to be fixed. Other optics (at least PRM, SRM, ETMX) seem to be working just fine.  It's just ETMY that has a problem.

 

  9412   Tue Nov 19 15:04:14 2013 JenneUpdateCDSCan talk to AUXEY again

The ETMY sliders on IFO_ALIGN were white again this morning, so I went down to the Yend and pushed the RESET button on auxey.  I then did a burt restore to 00:07am this morning for both auxey and auxex (since the stickers on the machines are still the old naming convention, I wonder if the autoburt is also backwards, so I did both).  Now the 'save' and 'restore' scripts for ETMY are working again.

Hopefully it's all better now, but I'll keep an eye on it.

  9422   Fri Nov 22 09:54:22 2013 SteveUpdateCDSDAQ?

Jamie, I think the computers know that you are away. c1lsc keeps going down.

The short time plots are correct.

Attachment 1: comp8d.png
comp8d.png
  9425   Mon Nov 25 10:57:14 2013 KojiUpdateCDSwoes on the X-end hosts

This morning I came in the 40m then found
1) c1auxex was throwing out the same errors as recently seen.
2) c1iscex processes had errors which persisted even after the mx stream reset.

1) c1auxex - fixed

Tried telnet c1auxex => rejected by the host

Went down to the south end. Power cycled the target. Came back to the control room.
=> Confirmed the epics read/write is back.
Burtrestored the epics vars for the target to the snapshot on 31th Oct at 5:07.

2) c1iscex - still not fixed

ssh c1iscex
rtcds restart all
=> c1x01 is still in red. 
Followed the procedure on the elog entry 9007. => Still the same error.

At least c1x01 is stalled. Here is the status.

Sync Source is TDS.
C1:DAQ-DC0_C1X01_STATUS is 0x2bad.
C1:DAQ-DC0_C1X01_CRC_SUM stays 0.
The screen shot is attached.

dmesg related to c1x01

controls@c1iscex ~ 0$ dmesg |grep c1x01
[   32.152010] c1x01: startup time is 1069440223
[   32.152012] c1x01: cpu clock 3000325
[   32.152014] c1x01: Epics shmem set at 0xffffc9001489c000
[   32.152208] c1x01: IPC at 0xffffc90018947000
[   32.152209] c1x01: Allocated daq shmem; set at 0xffffc9000480c000
[   32.152210] c1x01: configured to use 4 cards
[   32.152211] c1x01: Initializing PCI Modules
[   32.152226] c1x01: ADC card on bus b; device 4 prim b
[   32.152227] c1x01: adc card on bus b; device 4 prim b
[   32.154801] c1x01: pci0 = 0xdc300400
[   32.154837] c1x01: pci2 = 0xdc300000
[   32.154842] c1x01: ADC I/O address=0xdc300000  0xffffc90003f62000
[   32.154845] c1x01: BCR = 0x84060
[   32.154858] c1x01: RAG = 0x117d8
[   32.154861] c1x01: BCR = 0x84260
[   32.583220] c1x01: SSC = 0x16
[   32.583223] c1x01: IDBC = 0x1f
[   32.583236] c1x01: DAC card on bus 14; device 4 prim 14
[   32.583237] c1x01: dac card on bus 14; device 4
[   32.584527] c1x01: pci0 = 0xdc400400
[   32.584546] c1x01: dac pci2 = 0xdc400000
[   32.584551] c1x01: DAC I/O address=0xdc400000  0xffffc90003f6a000
[   32.584555] c1x01: DAC BCR = 0x810
[   32.584678] c1x01: DAC BCR after init = 0x30080
[   32.584681] c1x01: DAC CSR = 0xffff
[   32.584687] c1x01: DAC BOR = 0x3415
[   32.584693] c1x01: set_8111_prefetch: subsys=0x8114; vendor=0x10e3
[   32.584722] c1x01: Contec 1616 DIO card on bus 23; device 0
[   32.593429] c1x01: contec 1616 dio pci2 = 0x4001
[   32.593430] c1x01: contec 1616 diospace = 0x4000
[   32.593434] c1x01: contec dio pci2 card number= 0x0
[   32.593439] c1x01: Contec BO card on bus 18; device 0
[   32.593447] c1x01: contec dio pci2 = 0x3001
[   32.593448] c1x01: contec32L diospace = 0x3000
[   32.593451] c1x01: contec dio pci2 card number= 0x0
[   32.593456] c1x01: 5565 RFM card on bus 7; device 4
[   32.597218] Modules linked in: c1x01(+) open_mx mbuf
[   32.599939]  [<ffffffffa002e430>] mapRfm+0x71/0x392 [c1x01]
[   32.600199]  [<ffffffffa002ec91>] mapPciModules+0x540/0x8cf [c1x01]
[   32.600458]  [<ffffffffa002f2c1>] init_module+0x2a1/0x9d6 [c1x01]
[   32.600717]  [<ffffffffa002f020>] ? init_module+0x0/0x9d6 [c1x01]
[   32.616194] c1x01: RFM address is 0xd8000000
[   32.616196] c1x01: CSR address is 0xdc000000
[   32.616206] c1x01: Board id = 0x65
[   32.616209] c1x01: DMA address is 0xdc000400
[   32.616213] c1x01: 5565DMA at 0xffffc90003f72400
[   32.616215] c1x01: 5565 INTCR = 0xf010100
[   32.616217] c1x01: 5565 INTCR = 0xf000000
[   32.616218] c1x01: 5565 MODE = 0x43
[   32.616220] c1x01: 5565 DESC = 0x0
[   32.616232] c1x01: 5 PCI cards found
[   32.616233] c1x01: ***************************************************************************
[   32.616234] c1x01: 1 ADC cards found
[   32.616235] c1x01:     ADC 0 is a GSC_16AI64SSA module
[   32.616236] c1x01:         Channels = 64
[   32.616236] c1x01:         Firmware Rev = 34
[   32.616238] c1x01: ***************************************************************************
[   32.616239] c1x01: 1 DAC cards found
[   32.616239] c1x01:     DAC 0 is a GSC_16AO16 module
[   32.616240] c1x01:         Channels = 16
[   32.616241] c1x01:         Filters = None
[   32.616242] c1x01:         Output Type = Differential
[   32.616242] c1x01:         Firmware Rev = 6
[   32.616244] c1x01: MASTER DAC SLOT 0 1
[   32.616244] c1x01: ***************************************************************************
[   32.616246] c1x01: 0 DIO cards found
[   32.616246] c1x01: ***************************************************************************
[   32.616248] c1x01: 0 IIRO-8 Isolated DIO cards found
[   32.616248] c1x01: ***************************************************************************
[   32.616250] c1x01: 0 IIRO-16 Isolated DIO cards found
[   32.616250] c1x01: ***************************************************************************
[   32.616252] c1x01: 1 Contec 32ch PCIe DO cards found
[   32.616252] c1x01: 1 Contec PCIe DIO1616 cards found
[   32.616253] c1x01: 0 Contec PCIe DIO6464 cards found
[   32.616254] c1x01: 2 DO cards found
[   32.616255] c1x01: TDS controller 0 is at 0
[   32.616256] c1x01: Total of 4 I/O modules found and mapped
[   32.616257] c1x01: ***************************************************************************
[   32.616259] c1x01: 1 RFM cards found
[   32.616260] c1x01:     RFM 0 is a VMIC_5565 module with Node ID 41
[   32.616261] c1x01: address is 0x18d80000
[   32.616261] c1x01: ***************************************************************************
[   32.616262] c1x01: Initializing space for daqLib buffers
[   32.616263] c1x01: Initializing Network
[   32.616264] c1x01: Found 1 frameBuilders on network
[   32.616265] c1x01: Epics burt restore is 0
[   33.616012] c1x01: Epics burt restore is 0
[   34.617018] c1x01: Epics burt restore is 0
[   35.618017] c1x01: Epics burt restore is 0
[   36.619011] c1x01: Epics burt restore is 0
[   37.621007] c1x01: Epics burt restore is 0
[   38.622008] c1x01: Epics burt restore is 0
[   39.733257] c1x01: Sync source = 4
[   39.733257] c1x01: Waiting for EPICS BURT Restore = 1
[   39.793001] c1x01: Waiting for EPICS BURT 0
[   39.793001] c1x01: BURT Restore Complete
[   39.793001] c1x01: Found a BQF filter 0
[   39.793001] c1x01: Found a BQF filter 1
[   39.793001] c1x01: Initialized servo control parameters.
[   39.794002] c1x01: DAQ Ex Min/Max = 1 3
[   39.794002] c1x01: DAQ XEx Min/Max = 3 53
[   39.794002] c1x01: DAQ Tp Min/Max = 10001 10007
[   39.794002] c1x01: DAQ XTp Min/Max = 10007 10507
[   39.794002] c1x01: DIRECT MEMORY MODE of size 64
[   39.794002] c1x01: daqLib DCU_ID = 19
[   39.794002] c1x01: Calling feCode() to initialize
[   39.794002] c1x01: entering the loop
[   39.794002] c1x01: ADC setup complete
[   39.794002] c1x01: DAC setup complete
[   39.794002] c1x01: writing BIO 0
[   39.814002] c1x01: writing DAC 0
[   39.814002] c1x01: Triggered the ADC
[   40.874003] c1x01: timeout 0 1000000
[   40.874003] c1x01: exiting from fe_code()

 

Attachment 1: Screenshot.png
Screenshot.png
  9426   Mon Nov 25 12:57:54 2013 JamieUpdateCDStiming problem at c1iscex IO chassis

There is definitely a timing distribution malfunction at the c1iscex IO chassis.  There is no timing link between the "Master Timer Sequencer D050239" at the 1X6 and the c1iscex IO chassis.  Link lights at both ends are dead.  No timing, no running models.

It does not appear to be a problem with the Master Timer Sequencer.  I moved the c1iscey link to the J15 port on the sequencer and it worked fine.  This means its either a problem with the fiber or the timing card in the IO chassis.  The IO timing card is powered and does have what appear to be normal status lights on (except for the fiber link lights).  It's getting what I think is the nominal 4V power.  The connection to the IO chassis backplane board look ok.  So maybe it's just a dead fiber issue?

I do not know what could have been the problem with c1auxex, or if it's related to the fast timing issue.

  9427   Mon Nov 25 17:28:33 2013 JenneUpdateCDStiming problem at c1iscex IO chassis

Quote:

There is definitely a timing distribution malfunction at the c1iscex IO chassis.  There is no timing link between the "Master Timer Sequencer D050239" at the 1X6 and the c1iscex IO chassis.  Link lights at both ends are dead.  No timing, no running models.

It does not appear to be a problem with the Master Timer Sequencer.  I moved the c1iscey link to the J15 port on the sequencer and it worked fine.  This means its either a problem with the fiber or the timing card in the IO chassis.  The IO timing card is powered and does have what appear to be normal status lights on (except for the fiber link lights).  It's getting what I think is the nominal 4V power.  The connection to the IO chassis backplane board look ok.  So maybe it's just a dead fiber issue?

I do not know what could have been the problem with c1auxex, or if it's related to the fast timing issue.

 Steve and Koji looked around, and called around, and there seem to be no spare fibers that are long enough to reach the end, so Steve has ordered

"Tripp Lite N520-30M 100' Multimode Duplex 50/125 Fiber Optic Patch Cable LC/LC"

 and it should be here tomorrow.

  9428   Wed Nov 27 14:45:49 2013 JenneUpdateCDStiming problem at c1iscex IO chassis

 [Koji, Jenne]

The new fiber arrived today, and we tried it out.  No luck.  We think it is the timing card, so we'll need to get one, since we can't find a spare.

Order of operations:

* Lay new fiber on floor, plugged it in at both ends, saw no fiber link lights.

* From control room, killed all models running on c1iscex, shutdown computer.  Still no link lights.

* Power cycled computer and IO chassis.

* Tried plugging new fiber into different port on Master Timing Sequencer, with other end still plugged in to c1iscex.  Still no link lights.

* Looked around with flashlight at Xend IO chassis.  The board that the fiber is connected to does not have a power light, although the board next to it has 2.  We compared with the SUS IO chassis, and the board there with the fiber has one power light, plus the fiber link lights, as well as 2 on the board next to the fiber.  So, perhaps there's a problem with power distribution on the timing board at the Xend? 

* Unplugged and replugged the power connector to the timing board, inside the IO chassis, board next to the fiber's board got lights back, but the fiber's board did not.  However, power must be going through the board with the fiber attached, to the next board, so there's power at least on some part of the timing board, just not the whole thing.

From this, we conclude that the blue fiber that was in place is probably fine (or is not found guilty), and that we need a replacement timing board.  Koji didn't find one in the "CDS stuff" boxes underneath the Jenne Laser, and I feel like I recall Jamie saying that we would have to get a spare from somewhere else.  We rolled up the new spare fiber, and put it in the box with other "CDS Stuff" under the Jenne Laser table.

  9429   Wed Nov 27 16:29:21 2013 JenneUpdateCDSAccidentally turned off SUS IO chassis

[Jenne, Koji]

I was trying to lock the Yarm, and saw that I was not getting signals to go between the LSC and SCY models.  I had digital zeros for TRY, and when I overrode the trigger and tried to force signal to ETMY, I had digital zeros at the SUS-ETMY_LSC input. The corresponding filter bank in the rfm model was receiving signals, so the Dolphin connection between LSC and SUS was okay, it was just the RFM connection going to the end station that wasn't succeeding. 

Koji restarted the c1scy model, and then went inside the IFO room, and found that the SUS IO chassis power was offWe must have accidentally turned it off while we were in there earlier.  Koji turned on the power, and also restarted the rfm model, and we now have real signals going back and forth. 

Yarm is locked, ASS worked nicely, etc, etc, so things seem normal again (with the Yarm....ETMX stuff is still out of order).

  9432   Mon Dec 2 14:24:10 2013 SteveUpdateCDScomputer problems

Rack 1x6 is very noisy.

 SunFire X4600 computer: FB (directly below Megatron) has it's yellow warning light on. It must be loosing one of it's  fan bearings.

 

Jetstore's error message: IDE channel #2 reading error

Attachment 1: c1iscex.png
c1iscex.png
Attachment 2: 1X6.JPG
1X6.JPG
  9433   Mon Dec 2 16:04:47 2013 JamieUpdateCDSc1iscex timing problem mysteriously disappears??? (thanksgiving miracle???)

Quote:

There is definitely a timing distribution malfunction at the c1iscex IO chassis.  There is no timing link between the "Master Timer Sequencer D050239" at the 1X6 and the c1iscex IO chassis.  Link lights at both ends are dead.  No timing, no running models.

It does not appear to be a problem with the Master Timer Sequencer.  I moved the c1iscey link to the J15 port on the sequencer and it worked fine.  This means its either a problem with the fiber or the timing card in the IO chassis.  The IO timing card is powered and does have what appear to be normal status lights on (except for the fiber link lights).  It's getting what I think is the nominal 4V power.  The connection to the IO chassis backplane board look ok.  So maybe it's just a dead fiber issue?

I do not know what could have been the problem with c1auxex, or if it's related to the fast timing issue.

I just got over here from Downs, where I managed to convince Todd to let me borrow one of their three remaining timing slave boards for c1iscex.  I walked down to the X end to replace the board only to discover that the link light on the existing timing board was back!  c1iscex was not responding, so I hard rebooted the machine, and everything came up rosy (all green!):

festatus.png

To repeat, I DID NOTHING.  The thing was working when I got here.  I have no idea when it came back, or how, but it's at least working for the moment.  I re-enabled the watchdog for ETMX SUS and it's now damped normally.

I'm going to hold on to the timing card for a couple of days, in case the failure comes back, but we'll need to return it to Downs soon, and probably think about getting some spare backups from Columbia.

  9434   Mon Dec 2 17:05:13 2013 JenneUpdateCDSc1iscex timing problem mysteriously disappears??? (thanksgiving miracle???)

Steve was trying to do something to it this morning, but I'm not exactly clear on what it was.  Maybe that helped?  Steve, can you tell us what you were trying to do this morning?

  9435   Tue Dec 3 07:42:23 2013 SteveUpdateCDSc1iscex timing problem mysteriously disappears??? (thanksgiving miracle???)

Quote:

Steve was trying to do something to it this morning, but I'm not exactly clear on what it was.  Maybe that helped?  Steve, can you tell us what you were trying to do this morning?

 I was trying to repeat  elog 9007  I did only get to line 2 of the Solution by Koji when Ottavia shut down, where I was working. This was all what I did.

  9436   Tue Dec 3 17:08:06 2013 KojiUpdateCDScomputer problems

It seems that the front fan unit was running at the full speed. The fan itself seems still OK.

I talked with Jamie and make a power cycling (i.e. shutdown gracefully, unplug the power supply cables (x4), plug them in again, and pushed the power button)

The warning signal went off and the fan is quiet. FOR NOW.

Now, daqd and ndsd is down.

FB cannot mount /opt/rtcds and /cvs/cds during its boot.

After mounting these manually, I tried to run /opt/rtcds/caltech/c1/target/fb/start_daqd.inittab and /opt/rtcds/caltech/c1/target/fb/start_nds.inittab
but they don't keep running.

I'll be back to this issue tomorrow with Jamie's help.

  9437   Wed Dec 4 12:02:39 2013 KojiUpdateCDSFB restored

Now FB is fixed: daqd and nds are running


When I rebooted FB, I noticed that any of the nfs file systems were not mounted.
I started tracking down the issues from here.

I googled the common issues of the nfs mounting during the boot sequence.
- It is good to give "_netdev" option to fstab to mount the system after the network connection is established.

- "auto" option specifies that the file system is mounted when mount -a is run

Resulting /etc/fstab is this:

/dev/sdb1                            /            ext3    noatime                    0 1
/swapfile                            none         swap    sw                         0 0
shm                                  /dev/shm     tmpfs   nodev,nosuid,noexec        0 0
/dev/sda1                            /frames      ext3    noatime                    0 0
linux1:/home/cds/                    /cvs/cds     nfs     _netdev,auto,rw,bg,soft    0 0
linux1:/home/cds/rtcds               /opt/rtcds   nfs     _netdev,auto,rw,bg,soft    0 0
linux1:/home/cds/rtapps              /opt/rtapps  nfs     _netdev,auto,rw,bg,soft    0 0
linux1:/home/cds/caltech/apps/linux  /opt/apps    nfs     _netdev,auto,rw,bg,soft    0 0

But this didn't help mounting the nfs file systems at boot yet. I dug into google again and found a command "/sbin/rc-update".
"/sbin/rc-update show" shows what services are activated at boot. It did not include "nfsmount". So the following command
was executed

 

> sudo /sbin/rc-update add nfsmount boot

> /sbin/rc-update show

* Broken runlevel entry: /etc/runlevels/boot/portmap
            bootmisc | boot                         
             checkfs | boot                         
           checkroot | boot                         
               clock | boot                         
         consolefont | boot                         
               dcron |      default                 
               dhcpd |      default                 
            hostname | boot                         
            in.tftpd | boot                         
             keymaps | boot                         
               local |      default nonetwork       
          localmount | boot                         
             modules | boot                         
               monit |      default                 
                  mx |      default                 
            net.eth0 |      default                 
              net.lo | boot                         
            netmount |      default                 
                 nfs | boot                         
            nfsmount | boot                         
          ntp-client | boot default                 
           rmnologin | boot                         
           rpc.statd | boot                         
                sshd | boot                         
           syslog-ng | boot                         
      udev-postmount |      default                 
             urandom | boot                         
              xinetd |      default

After rebooting, I confirmed that the nfs file systems are correctly mounted
and daqd and nds are automatically started.

This means that FB had never been configured to run correctly at boot. Shame on you!

  9438   Wed Dec 4 13:37:34 2013 JenneUpdateCDSc1auxex down again

Quote:

1) c1auxex - fixed

Tried telnet c1auxex => rejected by the host

Went down to the south end. Power cycled the target. Came back to the control room.
=> Confirmed the epics read/write is back.
Burtrestored the epics vars for the target to the snapshot on 31th Oct at 5:07.

 When I came in this morning, in addition to the fb being unhappy [elog 9436] (which Koji later fixed [elog 9437] ), c1auxex was down / not talking to the world nicely. 

I tried telnet-ing, but was rejected, so EricQ and I went down to the Xend and pushed the reset button on the computer.  The computer came back up just fine, and I did a burt restore to 03:07 on Nov 30th.

  9441   Wed Dec 4 21:33:24 2013 KojiUpdateCDSc1scy time-over issue mitigated

c1scy had frequent time-over. This caused the glitches of the OSEM damping servos.

Today Eric Q was annoyed by the glitches while he worked on the green PDH inspection at the Y-end.

In order to mitigate this issue, low priority RFM channels are moved from c1scy to c1tst.
The moved channels (see Attachment 1) are supposed to be less susceptible to the additional delay.

This modification required the following models to be modified, recompiled, reinstalled, and restarted
in the listed order:
c1als, c1sus, c1rfn, c1tst, c1scy

Now the models are are running. CDS status is all green.
The time consumption of c1scy is now ~30us (porevious ~60us)
(see Attachment 2)

I am looking at the cavity lock of TEM00 and I have witnessed no glitch any more.
In fact, the OSEM signals have no glitch. (see Attachment 3)

We still have c1mcs having regularly time-over. Can I remove the WFS->OAF connections temporarily?

Attachment 1: TST.png
TST.png
Attachment 2: CDS.png
CDS.png
Attachment 3: no_glitch.png
no_glitch.png
  9444   Thu Dec 5 12:20:09 2013 JenneUpdateCDSfixing c1mcs timing - go for it

Quote:

We still have c1mcs having regularly time-over. Can I remove the WFS->OAF connections temporarily?

 Yes.  I think that's a good idea, since we're not using them at this time, and we want c1mcs to behave.  Maybe make an elog note of which version is the first without them, so that we can conveniently find the model(s) in the svn?

  9445   Thu Dec 5 16:20:26 2013 SteveUpdateCDSglitches are gone

Quote:

c1scy had frequent time-over. This caused the glitches of the OSEM damping servos.

Today Eric Q was annoyed by the glitches while he worked on the green PDH inspection at the Y-end.

In order to mitigate this issue, low priority RFM channels are moved from c1scy to c1tst.
The moved channels (see Attachment 1) are supposed to be less susceptible to the additional delay.

This modification required the following models to be modified, recompiled, reinstalled, and restarted
in the listed order:
c1als, c1sus, c1rfn, c1tst, c1scy

Now the models are are running. CDS status is all green.
The time consumption of c1scy is now ~30us (porevious ~60us)
(see Attachment 2)

I am looking at the cavity lock of TEM00 and I have witnessed no glitch any more.
In fact, the OSEM signals have no glitch. (see Attachment 3)

We still have c1mcs having regularly time-over. Can I remove the WFS->OAF connections temporarily?

 Koji cleaned up very nicely.

Attachment 1: glitchesGONE.png
glitchesGONE.png
Attachment 2: NOglitches.png
NOglitches.png
  9450   Sat Dec 7 19:29:14 2013 KojiUpdateCDSMEDM/ADL: replace rectangle with specified objects

In order to unify the theme for MEDM screens, I'll have to make many combinations of rectangles, polygons, and invisible related-screen buttons.
Everytime I change the size of the block, I have to modify the size of this combination. It is impossile for me.

So, I made a script to replace a certain type of rectangles with a combination of the objects with the same (or related) size.

The script is here (so far)

/opt/rtcds/caltech/c1/medm/c1lsc/master/generateLSCscreen/rect_replace.py

Usage:

cat C1LSC_OVERVIEW_ADC.adl | ./rect_replace.py > tmp.adl

Description:

The script takes stdin and spits the result to stdout. It parses a given ADL file. When a "rectangle" object
with Channel A with a string "REPLACE_XXXX", it replaces it with the objects predefined as "XXXX".

So far, there is only "TYPE1" for the predefinition. It actually takes another argument to specify the
path of the related screen to open when the block is clicked. The path should be filled in "Channel B"
slot of the original rectangle. (See Attachment 1)

The "TYPE1" style has the function calls as indicated below. place_rect is to place a rectangle object. You can
specify the filling method and color. place_rel_disp is to place an invisible button with the link to the specified
path by strOpt. place_polygon places a polygon. The cordinate array for the polygon is described with
the relative positions from the specified position.

        place_rect(rect_x-4,         rect_y-4,        rect_w+7, rect_h+7, "outline",  0) # outline white box
        place_rel_disp(rect_x, rect_y, rect_w, rect_h, strOpt, 0, 14)                    # invisible button
        place_rect(rect_x,           rect_y,          rect_w,   rect_h,   "fill",     3) # main gray box
        place_rect(rect_x+3,         rect_y,          rect_w-6, 3,        "fill",     0) # top white rim
        place_rect(rect_x,           rect_y,          3,        rect_h-3, "fill",     0) # left white rim
        place_rect(rect_x+rect_w-3 , rect_y,          3,        rect_h,   "fill",    10) # right gray rim
        place_rect(rect_x,           rect_y+rect_h-3, rect_w-3, 3,        "fill",    10) # bottom gray rim
        place_polygon(rect_x+rect_w-3,rect_y,3,3, "fill",  0, [[0,0],[2,0],[0,2],[0,0]]) # top-right white triangle
        place_polygon(rect_x,rect_y+rect_h-3,3,3, "fill",  0, [[0,0],[2,0],[0,2],[0,0]]) # bottom-left white triangle

Attachment 1: rectangle_config.png
rectangle_config.png
Attachment 2: rect_replace_result.png
rect_replace_result.png
  9466   Fri Dec 13 13:45:50 2013 KojiUpdateCDSMEDM/ADL: replace rectangle with specified objects

rect_replace.py script was updated.
This sounds crazy but it was actually quite easy as I could use a free font data.


/opt/rtcds/caltech/c1/medm/c1lsc/master/generateLSCscreen/rect_replace.py

Usage:

cat C1LSC_OVERVIEW_ADC.adl | ./rect_replace.py > tmp.adl

Description:

The script takes stdin and spits the result to stdout. It parses a given ADL file. When a "rectangle" object
with Channel A with a string "REPLACE_XXXX", it replaces it with the objects predefined as "XXXX".

Now new type "CHAR" (i.e. REPLACE_CHAR) was added. This replaces the string in Channel B slot
into 5x7 dot matrix representation of the string with the specified color. The dot size is derived from the
height of the original rectangular object.

 

Attachment 1: screen_shot.png
screen_shot.png
  9494   Thu Dec 19 14:40:42 2013 KojiUpdateCDSRFM Time over mitigation for c1mcs

I worked on the mitigation of c1mcs time-over issue this afternoon.

The timing for the c1mcs is successfully reduced from >60us to 45us.


The previous models are svned in redoubt as follows:

MCS rev. 6696
RFM rev. 6697
IOO rev. 6698

What I changed was:

- Remove connection from ALS (on c1ioo) to MCS (on c1sus). This should be all done in LSC. (# of RFM IPC in MCS -1)

- MC2 trans QPD filters are moved from IOO to MCS to reduce the RFM channels in MCS.
  Previously the signals for the 4 segments are sent. Now the processed siganls (pit/yaw/sum) are sent. (# of RFM IPC in IOO -1, MCS -1)

- WFS MC3 feedback channels are moved from MCS to RFM to distribute the RFM channels (# of RFM IPC in MCS -2, in RFM +2)

model    prev. timing[us] current timing[us]  diff in time[us]  diff in ch#
c1mcs         >60                45                -15              -4
c1rfm         47                 53                + 6              +2       
c1ioo         47                 36                -11              -1

Revisions of the new models:
MCS rev. 6702
RFM rev. 6701
IOO rev. 6700

  9498   Fri Dec 20 00:16:39 2013 KojiSummaryCDSRCG parsing bug?

A while ago, I noticed that the most significant bits of the LSC whitening DOs are not toggling.
I track this issue down and found what is happening. I need experts' help.


To illuminate the issue, terminators are connected to Bit15 of the Bit2Word blocks in the LSC model (attached screen shots).

The corresponding source file is found in c1lsc.c at the following location.
The last channels of the Bit2Word are connected to lsc_cm_slow (the filter module).
This is the source of the issue. This wrong assignment of the connections
can't be changed by connecting Go-From tags to the chennels.

/opt/rtcds/caltech/c1/rtbuild/src/fe/c1lsc/c1lsc.c

3881// Bit2Word:  LSC_cdsBit2Word1                                                                                                              
3882{
3883double ins[16] = {
3884        lsc_as110_logicaloperator4,
3885        lsc_as110_logicaloperator1,
3886        lsc_refl11_logicaloperator4,
3887        lsc_refl11_logicaloperator1,
3888        lsc_pox11_logicaloperator4,
3889        lsc_pox11_logicaloperator1,
3890        lsc_poy11_logicaloperator4,
3891        lsc_poy11_logicaloperator1,
3892        lsc_refl33_logicaloperator4,
3893        lsc_refl33_logicaloperator1,
3894        lsc_pop22_logicaloperator4,
3895        lsc_pop22_logicaloperator1,
3896        lsc_pop110_logicaloperator4,
3897        lsc_pop110_logicaloperator1,
3898        lsc_as165_logicaloperator4,
3899        lsc_cm_slow
3900};
3901lsc_cdsbit2word1 = 0;
3902for (ii = 0; ii < 16; ii++)
3903{
3904if (ins[ii]) {
3905lsc_cdsbit2word1 += powers_of_2[ii];
3906}
3907}
3908}

3946// Bit2Word:  LSC_cdsBit2Word2                                                                                                              
3947{                                                                                                                                           
3948double ins[16] = {                                                                                                                          
3949        lsc_as55_logicaloperator4,                                                                                                          
3950        lsc_as55_logicaloperator1,                                                                                                          
3951        lsc_refl55_logicaloperator4,                                                                                                        
3952        lsc_refl55_logicaloperator1,                                                                                                        
3953        lsc_pop55_logicaloperator4,                                                                                                         
3954        lsc_pop55_logicaloperator1,                                                                                                         
3955        lsc_refl165_logicaloperator4,                                                                                                       
3956        lsc_refl165_logicaloperator1,                                                                                                       
3957        lsc_logicaloperator_cm_ctrl,                                                                                                        
3958        ground,                                                                                                                             
3959        ground,                                                                                                                             
3960        lsc_logicaloperator_popdc,                                                                                                          
3961        lsc_logicaloperator_poydc,                                                                                                          
3962        lsc_logicaloperator_poxdc,                                                                                                          
3963        lsc_logicaloperator_refldc,                                                                                                         
3964        lsc_cm_slow                                                                                                                         
3965};                                                                                                                                          
3966lsc_cdsbit2word2 = 0;                                                                                                                       
3967for (ii = 0; ii < 16; ii++)                                                                                                                 
3968{                                                                                                                                           
3969if (ins[ii]) {                                                                                                                              
3970lsc_cdsbit2word2 += powers_of_2[ii];
3971}
3972}
3973}

 

Attachment 1: Bit2Word1.png
Bit2Word1.png
Attachment 2: Bit2Word2.png
Bit2Word2.png
  9503   Fri Dec 20 11:40:13 2013 JamieSummaryCDSRCG parsing bug?

I submitted a bug report for this:

https://bugzilla.ligo-wa.caltech.edu/bugzilla3/show_bug.cgi?id=553

However, given how old our RCG version is (2.5 vs. 2.8 current deployed at the sites) I don't think we're going to see any traction on this.  Even if this is still a bug in 2.8, they'll only fix it in 2.8.  There's no way they're going to make a bug fix release for 2.5.

We need to upgrade.

  9505   Fri Dec 20 18:00:02 2013 KojiSummaryCDSRCG parsing bug?

The bug is still there but the incorrect bits are now overridden.

Attachment 1: Screenshot-c1lsc-LSC.png
Screenshot-c1lsc-LSC.png
  9530   Tue Jan 7 22:44:45 2014 JenneUpdateCDSdaqd on fb is segfaulting every ~30 seconds

The daqd process is segfaulting and restarting itself every 30 seconds or so.  It's pretty frustrating. 

Just for kicks, I tried an mxstream restart, clearing the testpoints, and restarting the daqd process, but none of things changed anything.  

Manasa found an elog from a year ago (elog 7105 and preceding), but I'm not sure that it's a similar / related problem.  Jamie, please help us!

Here is a screen dump from the "dtail":

Every 1.0s: dmesg | tail -50                                                                                                                         Tue Jan  7 22:43:23 2014

[   33.498691]  [<ffffffff8104a063>] kthread+0x7a/0x82
[   33.498695]  [<ffffffff81003654>] kernel_thread_helper+0x4/0x10
[   33.498698]  [<ffffffff81049fe9>] ? kthread+0x0/0x82
[   33.498701]  [<ffffffff81003650>] ? kernel_thread_helper+0x0/0x10
[   33.498703] ---[ end trace 6236defa99b3e091 ]---
[   33.498705] mx INFO: Board 0: allocated MSI IRQ 67
[   33.498713] mx INFO: CPU0: PAT = 0x7010600070106
[   33.498715] mx INFO: CPU0: new PAT = 0x1010600070106
[   33.498718] mx INFO: Board 0: Using PAT index 6
[   33.499101] eth0: no IPv6 routers present
[   33.531013] mx INFO: Board 0: device 8, rev 0, 1 ports and 2096896 bytes of SRAM available
[   33.531017] mx INFO: Board 0: Bridge is 10de:005d
[   33.531228] mx INFO: Board 0: MAC address = 00:60:dd:46:ea:ec
[   33.535971] mx INFO: Loaded mcp of len 235448
[   34.489244] mx INFO: Starting usermode mapper at /opt/mx/sbin/mx_start_mapper
[   39.148855] mx INFO: mx0: Link0 is UP
[   39.588511] mx INFO: myri0: Will use skbuf frags (4096 bytes, order=0)
[   39.589299] mx INFO: 1 Myrinet board found and initialized
[  287.706367] daqd used greatest stack depth: 3368 bytes left
[86605.907520] daqd[18407]: segfault at 38b08e4c0 ip 00007f11b3942a6c sp 00007f10b1917d50 error 4
[86605.907530] daqd[18424]: segfault at 38b544f90 ip 00007f11b3942a6c sp 00007f10b12c6d30 error 4 in libc-2.10.1.so[7f11b390e000+14c000] in libc-2.10.1.so[7f11b390e000+14c00
0]
[86605.907544]
[86605.919454] daqd[21319] general protection ip:7f11b3942a6c sp:7f10b1814d30 error:0
[86605.919462] daqd[18442] general protection ip:7f11b3942a6c sp:7f10b0bf4d30 error:0
[86605.919615] daqd[18443]: segfault at 38aee3db0 ip 00007f11b3942a6c sp 00007f10b0b73d50 error 4 in libc-2.10.1.so[7f11b390e000+14c000]
[86605.919694] daqd[18412]: segfault at 38aff35d0 ip 00007f11b3942a6c sp 00007f10b1752d30 error 4
[86605.919701] daqd[18417]: segfault at 38b544f70 ip 00007f11b3942a6c sp 00007f10b154dd50 error 4 in libc-2.10.1.so[7f11b390e000+14c000]
[86605.919708] daqd[18445]: segfault at 38aff35b0 ip 00007f11b3942a6c sp 00007f10b0ab1d50 error 4
[86605.919733] daqd[18429]: segfault at 38b42ae90 ip 00007f11b3942a6c sp 00007f10b10c1d50 error 4 in libc-2.10.1.so[7f11b390e000+14c000]
[86605.919741] daqd[18440]: segfault at 38b08e480 ip 00007f11b3942a6c sp 00007f10b0cb6d30 error 4 in libc-2.10.1.so[7f11b390e000+14c000]
[86605.958551]  in libc-2.10.1.so[7f11b390e000+14c000] in libc-2.10.1.so[7f11b390e000+14c000]
[86605.958557]
[86605.958577]  in libc-2.10.1.so[7f11b390e000+14c000]
[86605.958586]  in libc-2.10.1.so[7f11b390e000+14c000]
[86605.959639] daqd used greatest stack depth: 3160 bytes left
[98139.100888] show_signal_msg: 13 callbacks suppressed
[98139.100895] daqd[23753]: segfault at 39c7363b0 ip 00007f5bf253ba6c sp 00007f5b69b48d30 error 4 in libc-2.10.1.so[7f5bf2507000+14c000]
[98687.815120] daqd used greatest stack depth: 2984 bytes left
[208995.594227] daqd[10386] general protection ip:7f3b7c930a6c sp:7f3a79f09d50 error:0 in libc-2.10.1.so[7f3b7c8fc000+14c000]
[353015.067479] daqd used greatest stack depth: 2880 bytes left
[367406.863618] daqd[13078]: segfault at 41 ip 0000000000000041 sp 00007fb1f0ba2cf8 error 14 in daqd[400000+7c000]
[367406.863833] daqd[13104] general protection ip:7fb2f3018a6c sp:7fb1f01c8d30 error:0
[367406.863877] daqd[13086] general protection ip:7fb2f3018a6c sp:7fb1f089ad30 error:0
[367406.877408] daqd[13080]: segfault at 41 ip 0000000000000041 sp 00007fb1f0ae0ca8 error 14 in daqd[400000+7c000]
[367406.877435]  in libc-2.10.1.so[7fb2f2fe4000+14c000]
[367406.877442] daqd[13100]: segfault at 39ba287b0 ip 00007fb2f3018a6c sp 00007fb1f034cd30 error 4 in libc-2.10.1.so[7fb2f2fe4000+14c000]
[367406.878372]  in libc-2.10.1.so[7fb2f2fe4000+14c000]
[399802.887523] daqd[18295] general protection ip:7fb056a71a6c sp:7faf96125f10 error:0 in libc-2.10.1.so[7fb056a3d000+14c000]
[410595.969327] daqd[22057]: segfault at 3a91f27b0 ip 00007f48e96eea6c sp 00007f47e6c26d50 error 4 in libc-2.10.1.so[7f48e96ba000+14c000]
[410595.988926] daqd[22068]: segfault at 3a91f2790 ip 00007f48e96eea6c sp 00007f47e681bd30 error 4 in libc-2.10.1.so[7f48e96ba000+14c000]

  9531   Tue Jan 7 23:08:01 2014 jamieUpdateCDS/frames is full, causing daqd to die

Quote:

The daqd process is segfaulting and restarting itself every 30 seconds or so.  It's pretty frustrating. 

Just for kicks, I tried an mxstream restart, clearing the testpoints, and restarting the daqd process, but none of things changed anything.  

Manasa found an elog from a year ago (elog 7105 and preceding), but I'm not sure that it's a similar / related problem.  Jamie, please help us

The problem is not exactly the same as what's described in 7105, but the symptoms are so similar I assumed they must have a similar source.

And sure enough, /frames is completely full:

controls@fb /opt/rtcds/caltech/c1/target/fb 0$ df -h /frames/
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              13T   13T     0 100% /frames
controls@fb /opt/rtcds/caltech/c1/target/fb 0$

So the problem in both cases was that it couldn't write out the frames.  Unfortunately daqd is apparently too stupid to give us a reasonable error message about what's going on.

So why is /frames full?  Apparently the wiper script is either not running, or is failing to do it's job.  My guess is that this is a side effect of the linux1 raid failure we had over xmas.

  9533   Tue Jan 7 23:13:47 2014 jamieUpdateCDS/frames is full, causing daqd to die

Quote:

So why is /frames full?  Apparently the wiper script is either not running, or is failing to do it's job.  My guess is that this is a side effect of the linux1 raid failure we had over xmas.

It actually looks like the wiper script has been running fine.  There is a log from Tuesday morning:

controls@fb ~ 0$ cat /opt/rtcds/caltech/c1/target/fb/wiper.log

Tue Jan  7 06:00:02 PST 2014

Directory disk usage:
/frames/trend/minute_raw 385289132k
/frames/trend/second 100891124k
/frames/full 12269554048k
/frames/trend/minute 1906772k
Combined 12757641076k or 12458633m or 12166Gb

/frames size 13460088620k at 94.78%
/frames is below keep value of 95.00%
Will not delete any files
df reported usage 97.72%
controls@fb ~ 0$

So now I'm wondering if something else has been filling up the frames today.  Has anything changed today that might cause more data than usual to be written to frames?

I'm manually running the wiper script now to clear up some /frames.  Hopefully that will solve the problem temporarily.

  9535   Tue Jan 7 23:50:27 2014 jamieUpdateCDS/frames space cleared up, daqd stabilized

The wiper script is done and deleted a whole bunch of stuff to clean up some space:

controls@fb ~ 0$ /opt/rtcds/caltech/c1/target/fb/wiper.pl --delete

Tue Jan  7 23:09:21 PST 2014

Directory disk usage:
/frames/trend/minute_raw 385927520k
/frames/trend/second 125729084k
/frames/full 12552144324k
/frames/trend/minute 2311404k
Combined 13066112332k or 12759875m or 12460Gb

/frames size 13460088620k at 97.07%
/frames above keep value of 95.00%
Frame area size is 12401156668k
/frames/full size 12552144324k keep 11781098835k
/frames/trend/second size 125729084k keep 24802313k
/frames/trend/minute size 2311404k keep 620057k
Deleting some full frames to free 771045488k
- /frames/full/10685/C-R-1068567600-16.gwf
- /frames/full/10685/C-R-1068567616-16.gwf
...
controls@fb ~ 0$ df -h /frames
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              13T   12T  826G  94% /frames
controls@fb ~ 0$
So it cleaned up 826G of space.  It looks like the fb is stabilized for the moment.  On site folks should confirm...

 

asdfasdfsadf sadf asdf

  9536   Tue Jan 7 23:53:35 2014 JamieUpdateCDSdaqd can't connect to c1vac1, c1vac2

dadq is logging the following error messages to it's log related to the fact that it can't connect to c1vac1 and c1vac2:

CAC: Unable to connect because "Connection timed out"
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "c1vac2.martian:5064"
    Source File: ../cac.cpp line 1127
    Current Time: Tue Jan 07 2014 23:50:53.355609430
..................................................................
CAC: Unable to connect because "Connection timed out"
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "c1vac1.martian:5064"
    Source File: ../cac.cpp line 1127
    Current Time: Tue Jan 07 2014 23:50:53.356568469
..................................................................

 Not sure if this is related to the full /frames issue that we've been seeing.

  9567   Wed Jan 22 18:17:46 2014 JenneUpdateCDSfb timing was off

Since this morning, the fb's timing has been off.  Steve pointed it out to me earlier today, but I didn't have a chance to look at it until now. 

This was different from the more common problem of the mx stream needing to be restarted - that causes 3 red blocks per core, on all cores on a computer, but it doesn't have to be every computer.  This was only one red block per core in the CDS FE status screen, but it was on every core on every computer. 

The error message, when you click into the details of a single core, was 0x4000.  I elog searched for that, and found elog 6920, which says that this is a timing issue with the frame builder.  Since Jamie had already set things on nodus' config correctly, all I did was reconnect the fb to the ntp: 

fb$ sudo /etc/init.d/ntp-client restart

As in elog 6920, the daqd stopped, then restarted itself, and cleared the error message. It looks like everything is good again.

I suspect (without proof) that this may have to do with the campus network being down this morning, so the computers couldn't sync up with the outside world.

  9587   Thu Jan 30 11:59:03 2014 manasaUpdateCDSfb timing was off

Quote:

Since this morning, the fb's timing has been off.  Steve pointed it out to me earlier today, but I didn't have a chance to look at it until now. 

This was different from the more common problem of the mx stream needing to be restarted - that causes 3 red blocks per core, on all cores on a computer, but it doesn't have to be every computer.  This was only one red block per core in the CDS FE status screen, but it was on every core on every computer. 

The error message, when you click into the details of a single core, was 0x4000.  I elog searched for that, and found elog 6920, which says that this is a timing issue with the frame builder.  Since Jamie had already set things on nodus' config correctly, all I did was reconnect the fb to the ntp: 

fb$ sudo /etc/init.d/ntp-client restart

As in elog 6920, the daqd stopped, then restarted itself, and cleared the error message. It looks like everything is good again.

I suspect (without proof) that this may have to do with the campus network being down this morning, so the computers couldn't sync up with the outside world.

The above timing problem has been repeating (a couple of times this week so far). It does not seem to be related to the campus network.

The same solution was applied.

  9618   Mon Feb 10 18:03:41 2014 jamieUpdateCDS12 core c1sus replacement

I have configured one of the spare Supermicro X8DTU-F chassis as a dual-CPU, 12-core CDS front end machine.  This is meant to be a replacement for c1sus.  The extra cores are so we can split up c1rfm and reduce the over-cycle problems we've been seeing related to RFM IPC delays.

I pulled the machine fresh out of the box, and installed the second CPU and additional memory that Steve purchased.  The machine seems to be working fine.  After assigning it a temporary IP address, I can boot it from the front-end boot server on the martian network.  It comes up cleanly with both CPUs recognized, and /proc/cpustat showing all 12 cores, and free showing 12 GB memory.

The plan is:

  1. pull the old c1sus machine from the rack
  2. pull OneStop, Dolphin, RFM cards from c1sus chassis
  3. installed OneStop, Dolphin, RFM cards into new c1sus
  4. install new c1sus back in rack
  5. power everything on and have it start back up with no problems

Obviously the when of all this needs to be done when it won't interfere with locking work.  fwiw, I am around tomorrow (Tuesday, 2/11), but will likely be leaving for LHO on Wednesday.

  9662   Mon Feb 24 13:40:13 2014 JenneUpdateCDSComputer weirdness with c1lsc machine

I noticed that the fb lights on all of the models on the c1lsc machine are red, and that even though the MC was locked, there was no light flashing in the IFO. Also, all of the EPICS values on the LSC screen were frozen.

Screenshot-Untitled_Window-1.png

I tried restarting the ntp server on the frame builder, as in elog 9567, but that didn't fix things.  (I realized later that the symptom there was a red light on every machine, while I'm just seeing problems with c1lsc. 

I did an mxstream restart, as a harmless thing that had some small hope of helping (it didn't). 

I logged on to c1lsc, and restarted all of the models (rtcds restart all), which stops all of the models (IOP last), and then restarts them (IOP first).  This did not change the status of the lights on the status screen, but it did change the positioning of some optics (I suspect the tip tilts) significantly, and I was again seeing flashes in the arms.  The LSC master enable switch was off, so I don't think that it was trying to send any signals out to the suspensions.  The ASS model, which sends signals out to the input pointing tip tilts runs on c1lsc, and it was about when the ass model was restarted that the beam came back.  Also, there are no jumps in any of the SOS OSEM sensors in the last few hours, except me misaligning and restoring the optics.  I we don't have sensors on the tip tilts, so I can't show a jump in their positioning, but I suspect them.

I called Jamie, and he suggested restarting the machine, which I did.  (Once again, the beam went somewhere, and I saw it scattering big-time off of something in the BS chamber, as viewed on the PRM-face camera).  This made the oaf and cal models run (I think they were running before I did the restart all, but they didn't come back after that.  Now, they're running again).  Anyhow, that did not fix the problem.  For kicks, I re-ran mxstream restart, and diag reset, to no avail.  I also tried running the sudo /etc/init.d/ntp-client restart command on just the lsc machine, but it doesn't know the command 'ntp-client'. 

Jamie suggested looking at the timing card in the chassis, to ensure all of the link lights are on, etc.  I will do this next.

  9663   Mon Feb 24 15:25:29 2014 JenneUpdateCDSComputer weirdness with c1lsc machine

The LSC machine isn't any better, and now c1sus is showing the same symptoms.  Lame.

The link lights on the c1lsc I/O chassis and on the fiber timing system are the same as all other systems.  On the timing card in the chassis, the light above the fibers was solid-on, and the light below blinks at 1pps. 

Koji and I power-cycled both the lsc I/O chassis, and the computer, including removing the power cables (after softly shutting down) so there was seriously no power.  Upon plugging back in and turning everything on, no change to the timing status.  It was after this reboot that the c1sus machine also started exhibiting symptoms. 

  9664   Mon Feb 24 16:26:14 2014 JenneUpdateCDSNTP fell out of sync on front end machines - fixed

[Koji, Jenne]

Koji noticed that the time on the front-end detail screens was not correct, and that the GPS time was not matching up between different models.  Koji ran the following on all front-end machines, and on nodus:

sudo ntpdate -b -s -u pool.ntp.org

Now, everything is fine, and every status light on the cds overview screen is green.

  9679   Wed Feb 26 23:14:07 2014 JenneUpdateCDSfb timing was off

....fb timing issue happened again.

I thought that it was the thing that Koji and I saw the other day, where it was individual front end computers that had lost ntp sync, since it wasn't every core on every computer that was red, but reconnecting to the ntp server on c1lsc didn't do anything.  I then tried reconnecting to the ntp server on fb, and that fixed things right up.  Annoying.

  9683   Mon Mar 3 10:42:53 2014 JenneUpdateCDSfb timing was off

...yet again.

lsc and sus needed mxstream restarts after I restarted the ntp on fb.

  9684   Mon Mar 3 11:55:39 2014 KojiUpdateCDSfb timing was off

We need to correctly setup crontab or rc.local for the frontend machines.

  9706   Mon Mar 10 11:42:36 2014 JenneUpdateCDSfb timing was off

fb timing was off again.

  9732   Mon Mar 17 12:31:58 2014 manasaUpdateCDSfb timing was off

Off again. Restarted ntp on fb.

  9786   Mon Apr 7 15:26:32 2014 jamie, ericqUpdateCDSaborted attempt to update c1sus machine with second CPU

This morning we attempted to replace the c1sus front end machine with a spare that had been given a second CPU, and therefore 6 additional cores (for a total of 12).  The idea was to give c1sus more cores so that we could split up c1rfm into two separate models that would not be running on the hairy edge of their cycle time allocation.  Well, after struggling to get it working we eventually aborted and put the old machine back in.

The problem was that the c1sus model was running erratically, frequently jumping up to 100 usec of a 60 usec clock allocation.  We eventually tracked the problem down to the fact that the CPUs in the new machine are of an inferior and slower model, than what's in the old c1sus machine.  The CPU were running about 30% slower, which was enough to bump c1sus, which nominally runs at ~51 usec, over it's limit.

This is of course stupid, and I take the blame.  I skimped on the CPUs when I bought the spare machines in an attempt to keep the cost down, and didn't forgot that I had done that when we started discussing using one of the spares as a c1sus replacement.

I think we can salvage things, though, by just purchasing a better CPU, one that matches what's currently in c1sus.  I'll get Steve on it:

c1sus CPU: Intel(R) Xeon(R) CPU X5680 3.33GHz

In any event, the old c1sus is back in place, and everything is back as it was.

Attachment 1: Screenshot-Untitled_Window.png
Screenshot-Untitled_Window.png
  9822   Thu Apr 17 11:00:54 2014 jamieUpdateCDSfailed attempt to get Dolphin working on c1ioo

I've been trying to get c1ioo on the Dolphin network, but have not yet been successful.

Background: if we can put the c1ioo machine on the fast Dolphin IPC network, we can essentially eliminate latencies between the c1als model and the c1lsc model, which are currently connected via a rube goldberg-esq c1lsc->dolphin->c1sus->rfm->c1ioo configuration.

Rolf gave us a Dolpin host adapter card, and we purchased a Dolphin fiber cable to run from the 1X2 rack to the 1X4 rack where the Dolphin switch is.

Yesterday I installed the dolphin card into c1ioo.  Unfortunately, c1ioo, which is Sun Fire X4600, and therefore different than the rest of the front end machines, doesn't seem to be recognizing the card.  The /etc/dolphin_present.sh script, which is supposed to detect the presence of the card by grep'ing for the string 'Stargen' in the lspci output, returns null.

I've tried moving the card to different PCIe slots, as well as swapping it out with another Dolphin host adapter that we have.  Neither worked.

I looked at the Dolphin host adapter installed in c1lsc and it's quite different, presumably a newer or older model.  Not sure if that has anything to do with anything.

I'm contacting Rolf to see if he has any other ideas.

  9824   Thu Apr 17 16:59:45 2014 jamieUpdateCDSslightly more successful attempt to get Dolphin working on c1ioo

So it turns out that the card that Rolf had given me was not a Dolphin host adapter after all.  He did have an actual host adapter board on hand, though, and kindly let us take it.  And this one works!

I installed the new board in c1ioo, and it recognized it.  Upon boot, the dolphin configuration scripts managed to automatically recognize the card, load the necessary kernel modules, and configure it.  I'll describe below how I got everything working.

However, at some point mx_stream stopped working on c1ioo.  I have no idea why, and it shouldn't be related to any of this dolphin stuff at all.  But given that mx_stream stopped working at the same time the dolphin stuff started working, I didn't take any chances and completely backed out all the dolphin stuff on c1ioo, including removing the dolphin host adapter from the chassis all together.  Unfortunately that didn't fix any of the mx_stream issues, so mx_stream continues to not work on c1ioo.  I'll follow up in a separate post about that.  In the meantime, here's what I did to get dolphin working on c1ioo:

c1ioo Dolphin configuration

To get the new host recognized on the Dolphin network, I had to make a couple of changes to the dolphin manager setup on fb.  I referenced the following page:

https://cdswiki.ligo-la.caltech.edu/foswiki/bin/view/CDS/DolphinHowTo

Below are the two patches I made to the dolphin ("dis") config files on fb:

--- /etc/dis/dishosts.conf.bak    2014-04-17 09:31:08.000000000 -0700
+++ /etc/dis/dishosts.conf    2014-04-17 09:28:27.000000000 -0700
@@ -26,6 +26,8 @@
 ADAPTER:  c1sus_a0 8 0 4
 HOSTNAME: c1lsc
 ADAPTER:  c1lsc_a0 12 0 4
+HOSTNAME: c1ioo
+ADAPTER:  c1ioo_a0 16 0 4
 
 # Here we define a socket adapter in single mode.
 #SOCKETADAPTER: sockad_0 SINGLE 0

--- /etc/dis/networkmanager.conf.bak    2014-04-17 09:30:40.000000000 -0700
+++ /etc/dis/networkmanager.conf    2014-04-17 09:30:48.000000000 -0700
@@ -39,7 +39,7 @@
 # Number of nodes in X Dimension. If you are using a single ring, please
 # specify number of nodes in ring.
 
--dimensionX 2;
+-dimensionX 3;
 
 # Number of nodes in Y Dimension.

I then had to restart the DIS network manager to see these changes take affect:

$ sudo /etc/init.d/dis_networkmgr restart

I then rebooted c1ioo one more time, after which c1ioo showed up in the dxadmin GUI.

At this point I tried adding a dolphin IPC connection between c1als and c1lsc to see if it worked.  Unfortunately everything crashed every time I tried to run the models (including models on other machines!).  The problem was that I had forgotten to tell the c1ioo IOP (c1x03) to use PCIe RFM (i.e. Dolphin).  This is done by adding the following flag to the cdsParamters block in the IOP:

pciRfm=1

Once this was added, and the IOP was rebuilt/installed/restarted and came back up fine.  The c1als model with the dolphin output also came up fine.

However, at this point I ran into the c1ioo mx_stream problem and started backing everything out.

 

  9825   Thu Apr 17 17:15:54 2014 jamieUpdateCDSmx_stream not starting on c1ioo

While trying to get dolphin working on c1ioo, the c1ioo mx_stream processes mysteriously stopped working.  The mx_stream process itself just won't start now.  I have no idea why, or what could have happened to cause this change.  I was working on PCIe dolphin stuff, but have since backed out everything that I had done, and still the c1ioo mx_stream process will not start.

mx_stream relies on the open-mx kernel module, but that appears to be fine:

controls@c1ioo ~ 0$ /opt/open-mx/bin/omx_info  
Open-MX version 1.3.901
 build: root@fb:/root/open-mx-1.3.901 Wed Feb 23 11:13:17 PST 2011

Found 1 boards (32 max) supporting 32 endpoints each:
 c1ioo:0 (board #0 name eth1 addr 00:14:4f:40:64:25)
   managed by driver 'e1000'
   attached to numa node 0

Peer table is ready, mapper is 00:30:48:d6:11:17
================================================
  0) 00:14:4f:40:64:25 c1ioo:0
  1) 00:30:48:d6:11:17 c1iscey:0
  2) 00:25:90:0d:75:bb c1sus:0
  3) 00:30:48:be:11:5d c1iscex:0
  4) 00:30:48:bf:69:4f c1lsc:0
controls@c1ioo ~ 0$ 

However, if trying to start mx_stream now fails:

controls@c1ioo ~ 0$ /opt/rtcds/caltech/c1/target/fb/mx_stream -s c1x03 c1ioo c1als -d fb:0
c1x03
mmapped address is 0x7f885f576000
mapped at 0x7f885f576000
send len = 263596
OMX: Failed to find peer index of board 00:00:00:00:00:00 (Peer Not Found in the Table)
mx_connect failed
controls@c1ioo ~ 1$ 

I'm not quite sure how to interpret this error message.  The "00:00:00:00:00:00" has the form of a 48-bit MAC address that would be used for a hardware identifier, ala the second column of the OMC "peer table" above, although of course all zeros is not an actual address.  So there's some disconnect between mx_stream and the actually omx configuration stuff that's running underneath.

Again, I have no idea what happened.  I spoke to Rolf and he's going to try to help sort this out tomorrow.

Attachment 1: c1ioo_no_mx_stream.png
c1ioo_no_mx_stream.png
ELOG V3.1.3-