40m QIL Cryo_Lab CTN SUS_Lab CAML OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 50 of 354  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  9020   Fri Aug 16 21:15:04 2013 ranaUpdateCDSNew/old CDS laptop for X-End

I took the "aso-laptop" and made it into Ubuntu a couple months ago. Today I added it to the Martian network and then moved it to the X End.

I followed the instructions in (https://wiki-40m.ligo.caltech.edu/Network) and added it to the files in /var/named/chroot/var/named on linux1 and did the "service named restart".

The router already had his MAC address in its list (because Yoichi was illegally using his personal laptop on the Martian). The new laptop's name is 'asia'. This is a legal name according to our computer naming conventions and this Wikipedia page (http://en.wiktionary.org/wiki/Category:Italian_female_given_names). It has been added to the Name Pool on the wiki.

The terminal on the laptop still calls itself 'aso-laptop' so I need some help in fixing that. It successfully connects to 40MARS and displays a MEDM sitemap after sshing in to pianosa.

I use 'ssh -X -C' since I find that compression actually helps when the laptops are so far from the router.

  9021   Sun Aug 18 16:04:07 2013 ranaSummaryCDSFB lights all RED: mxstream restart

Sun Aug 18 15:52:50 2013

Found the FB lights (C1:FEC-NN_FB_NET_STATUS and C1:DAQ-DC0_C1XXX_STATUS) RED for everything on the CDS_FE_STATUS screen.

I used the (! mxstream restart) button ro restart the mxstreams. Everything is green now.

PMC was out of lock- relocked it and the IMC locked itself as did the X & Y arms on IR. X was already green locked.

Attachment 1: IFO-Trend.png
IFO-Trend.png
  9022   Sun Aug 18 17:56:16 2013 ranaSummaryCDSMEDM Screen CPU Usages

I noticed at LLO (?) that the LSC screen there uses up ~25-30% of the CPU time on a single core for the control room iMac workstations - this seems excessive.

Here is an accounting of CPU usage percentages for some of our screens:

 

Screen Name CPU (%)
LSC_OVERVIEW 7
ALS_OVERVIEW 0
ALS 1
SUS_SUMMARY 0
IOO_WFS_MASTER 0.3
OPLEV_MASTER 0.5

These were measured using the program 'glances' on rosalba. MEDM running with only the sitemap used up 0.9% of a CPU. With the screens running, the fluctuation from sample to sample could be ~ +/- 0.5%. While the LSC screen seems to be the biggest pig, it is only big in comparison to small pigs. Certainly this pig has gotten bigger after getting sent to Louisiana.

Attachment 1: obama1404_666531c.jpg
obama1404_666531c.jpg
  9066   Mon Aug 26 19:54:15 2013 manasaUpdateCDSc1als model modified

I had made changes to the c1als model a couple of weeks ago. I had removed all the beat_coarse channels that had existed from pre-phase tracker era.

Also, I forgot to elog about it then. This dawned on me only when I found that c1als isn't working the way it should right now.

mistake-cartoon.gif

  9070   Tue Aug 27 15:44:08 2013 manasaUpdateCDSIssues with ALS fixed

I figured out the problem with ALS from yesterday. While the model was just fine, the medm screens were not checked if they were reading the correct channel names. 

The channel names of the ALS input matrix elements had changed when the coarse channels were deleted from the c1als model. So the error signals were not reaching the servo modules as expected. This is why I was not able to make sense as to what the ALS was doing. 

All is fixed now and should be back to normal

  9074   Tue Aug 27 19:34:36 2013 JamieConfigurationCDSfront end IPC configuration

So the IPC situation on the front end network is not so great right now.  For various no-longer-valid reasons, c1lsc had no RFM card, all the IPC connections were routed through the c1rfm model on c1sus, and routed to c1lsc via dolphin PCIe as needed.  As things grew, c1rfm became overloaded.  Koji tried to fix the situation by breaking things out of c1rfm to make direct connections where we could.  This cleared up c1rfm a bit, but not c1mcs is overloading.

Reminder: PCIe (dolphin) is faster and higher bandwidth than RFM.  The more things we can put on PCIe the better.

Attached is a graph of my rough accounting of the intended direct IPC connections between the front ends.  By "intended direct" I mean what should be direct connections if we had all the appropriate hardware.  Right now the actual connection graph is more convoluted than this since things are passing through c1rfm.  I note this graph was NOT particularly easy to make, which is very unfortunate.  I had to manually look through every model and determine the ultimate source of every incoming IPC.  Kind of a pain in the butt.  It would be nice if there was a simple way to represent this.

Here are some various solutions to the problem as I see it:

a) put c1lsc on the RFM network

This would allow c1lsc to talk to c1ioo, c1iscex, and c1iscey without having to go through c1sus, thereby eliminating c1rfm altogether.  I'm not sure why we didn't just do this originally.

Requires:

  • One RFM card for c1lsc

b) put c1ioo on the PCIe network (and move c1sus's RFM card to c1lsc)

This is probably the most robust solution.

b1) There are roughly 8 IPCs going from c1ioo to c1sus, and 4 going the other way, and 3 IPCs from c1ioo to c1lsc.  If we put c1ioo on PCIe all of these now RFM connections would become direct PCIe connections, which would be a big win.

At this point only the end station front ends would be on RFM, and most of the connections to those come from c1lsc, so it would make sense to give c1lsc the RFM card, thereby eliminating a lot of stuff from c1rfm.

Requires:

  • dolphin card for c1ioo (do the old sun machines support these?  if they don't we could swap the old sun machine with a new spare aLIGO-approved supermicro machines, which we have spares of)
  • dolphin fibre to go to dolphin switch in 1X3 rack

b2) OR, we could move c1ioo to 1X4 with c1lsc and c1sus, and get a OneStop fibre cable to connect to its IO chassis.  We would still need a dolphin card, but we could use coper instead of fibre.  This is my preferred solution, since it moves c1ioo out of 1X1, where it's really in the way and making a lot of noise.  It would also be easier to manage all the machines if they're together in one rack.

Requires:

  • dolphin card for c1ioo
  • dolphin coper cable for c1ioo
  • OneStop fibre for c1ioo

c) put another cpu in c1sus

c1sus is (I believe) able to support another 6-core cpu.  If we added more cores to c1sus, we could break up c1rfm into c1rfm0, c1rfm1, etc.  This is a less elegant solution imho, but it would probably do the job.

Requires:

  • one new CPU for c1sus
Attachment 1: hosts.png
hosts.png
  9076   Tue Aug 27 20:43:34 2013 KojiConfigurationCDSfront end IPC configuration

The reason we had the PCIe/RFM system was to test this mixed configuration in prior to the actual implementation at the sites.
Has this configuration been intesively tested at the site with practical configuration?

Quote:

Attached is a graph of my rough accounting of the intended direct IPC connections between the front ends. 

It's hard to believe that c1lsc -> c1sus only has 4 channels. We actuate ITMX/Y/BS/PRM/SRM for the length control.
In addition to these, we control the angles of ITMX/Y/BS/PRM (and SRM in future) via c1ass model on c1lsc.
So there should be at least 12 connections (and more as I ignored MCL).

I personally prefers to give the PCIe card to c1ioo and move the RFM card to c1lsc.
But in either cases, we want to quantitatively compare what the current configuration is (not omitting the bridging by c1rfm),
and what the future configuration will be including the addtional channels we want add in close future,

because RFM connections are really costly and moving the RFM card to c1lsc may newly cause the timeout of c1lsc
just instead of c1sus.

  9077   Wed Aug 28 00:41:23 2013 JenneUpdateCDSCDS svn commits not happening

svn status update. asx, als and ioo were found not committed. Not sure about who modified ioo last after Jenne.

//edit Manasa - edited the/ elog instead of replying //

  9079   Wed Aug 28 05:21:58 2013 manasaUpdateCDSCDS svn commits not happening

I am responsible for missed svn commits with als and asx. I have committed them.

But I have not modified anything with ioo in the last few weeks.

 

  9086   Wed Aug 28 19:47:28 2013 jamieConfigurationCDSfront end IPC configuration

Quote:

It's hard to believe that c1lsc -> c1sus only has 4 channels. We actuate ITMX/Y/BS/PRM/SRM for the length control.
In addition to these, we control the angles of ITMX/Y/BS/PRM (and SRM in future) via c1ass model on c1lsc.
So there should be at least 12 connections (and more as I ignored MCL).

Koji was correct that I missed some connections from c1lsc to c1sus.  I corrected the graph in the original post.

Also, I should have noted, that that graph doesn't actually include everything that we now have.  I left out all the simplant stuff, which adds extra connections between c1lsc and c1sus, mostly because the sus simplant is being run on c1lsc only because there was no space on c1sus.  That should be corrected, either by moving c1rfm to c1lsc, or by adding a new core to c1sus.

I also spoke to Rolf today and about the possibility of getting a OneStop fiber and dolphin card for c1ioo.  The dolphin card and cable we should be able to order no problem.  As for the OneStop, we might have to borrow a new fiber-supporting card from India, then send our current card to OneStop for fiber-supporting modifications.  It sounds kind of tricky.  I'll post more as I figure things out.

Rolf also said that in newer versions of the RCG, the RFM direct memory access (DMA) has improved in performance considerably, which reduces considerably the model run-time delay involved in using the RFM.  In other words, the long awaited RCG upgrade might alleviate some of our IPC woes.

We need to upgrade the RCG to the latest release (2.7)

  9087   Wed Aug 28 23:09:55 2013 jamieConfigurationCDScode to generate host IPC graph
Attachment 1: hosts.png
hosts.png
Attachment 2: 40m-ipcs-graph.py
#!/usr/bin/env python

# ipc connections: (from, to, number)
ipcs = [
    ('c1scx', 'c1lsc', 1),
    ('c1scy', 'c1lsc', 1),
    ('c1oaf', 'c1lsc', 8),

    ('c1scx', 'c1ass', 1),
    ('c1scy', 'c1ass', 1),
... 96 more lines ...
  9137   Wed Sep 18 11:29:43 2013 manasaUpdateCDSDataviewer cannot connect to fb

Masayuki pointed out that dataviewer wasn't connecting to the fb this morning.

When I started dataviewer from the terminal I obtained the following error:

controls@pianosa:~ 0$ dataviewer
Can't find hostname `fb:8088'
Can't find hostname `fb:8088'; gethostbyname(); error=1
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Error in obtaining chan info.
Can't find hostname `fb:8088'
Can't find hostname `fb:8088'; gethostbyname(); error=1

I checked the CDS FE status screen and it looks normal. I could ping the fb and ssh to it as well.

I restarted fb to see if it made any difference. telnet fb 8088

It hasn't helped. Anything else that can be done??

CDS_FE.png

  9138   Wed Sep 18 11:52:53 2013 JamieUpdateCDSDataviewer cannot connect to fb

Quote:

Masayuki pointed out that dataviewer wasn't connecting to the fb this morning.

When I started dataviewer from the terminal I obtained the following error:

controls@pianosa:~ 0$ dataviewer
Can't find hostname `fb:8088'
Can't find hostname `fb:8088'; gethostbyname(); error=1
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Error in obtaining chan info.
Can't find hostname `fb:8088'
Can't find hostname `fb:8088'; gethostbyname(); error=1

I checked the CDS FE status screen and it looks normal. I could ping the fb and ssh to it as well.

I restarted fb to see if it made any difference. telnet fb 8088

It hasn't helped. Anything else that can be done??

I've fixed the problem.  This was due to a change I made in the NDSSERVER environment variable so that it would work with cdsutils.  I didn't realize there was an incompatibility with how dataviewer parses NDSSERVER.  Joe and I will have to figure it out.

In the mean time I've changed things back so that that dataviewer should now work as expected.  You might have to log out and back in for it to work (or at least open a new terminal).

  9182   Tue Oct 1 14:12:22 2013 ranaSummaryCDSsvndumpfilter on linux1 makes NFS slow

 Yesterday and this morning's slow NFS disk access was caused by 'svndumpfilter' being run at linux1 to carve out the Noise Budget directory. It is being moved to another server; I think the disk access is back to normal speed now.

  9184   Tue Oct 1 19:42:19 2013 ranaSummaryCDSmegatron upgrade

Max and I started upgrading megatron to Ubuntu 12.NN today. We were having some troubles with getting latest python code to run to support the Summary pages stuff.

Its also a nice test to see what CDS tools fail on there, before we upgrade the workstations to Ubuntu 12.

Since its Linux, none of the usual upgrading commands worked, but after an hour or so of reading forums we were able to delete some packages and all the 3rd party packages and get the upgrade to go ahead. We'll have to re-install the LSC, GDS, LAL repos to get it back into shape and get NDS2 working. The upgrade is running in a 'screen' command on there.


Wed Oct 02 14:50:16 2013 

Update #1: The upgrade asks a couple dozen questions so it doesn't proceed by itself. I've been checking in to the 'screen' every couple hours to type in 'Yes' to let it keep going.


Update #2: It finished a few hours ago:

controls@megatron:~ 0$ uname -a
Linux megatron 3.2.0-54-generic #82-Ubuntu SMP Tue Sep 10 20:08:42 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
controls@megatron:~ 0$ date
Wed Oct  2 18:33:41 PDT 2013

  9259   Tue Oct 22 18:55:55 2013 ranaUpdateCDSWorkstation swap: Rosalba to ???

We got a new computer from Xi computer corp. I am currently installing Ubuntu 10.04 LTS on to it to start with and then will move on to 12 if we can figure out a way to test it besides "I guess it should work?"

Rosalba has been removed and put onto the old Jamie desk. Old Jamie desk also has a Mac Mini running on there.

At the meeting tomorrow we need to decide on a new Italian baby girl name for this new machine.

  9271   Wed Oct 23 22:11:20 2013 ranaUpdateCDSWorkstation swap: Rosalba to ???

  I've finished setting up the fstab on Chiara and the upgrade to Ubuntu 12 seems to have gone well enough. She's fast:

24.png

but I forgot to make sure to order a dual head graphics card for it. So we'll order some dual DVI gaming card that the company recommends. Until then, its only one monitor.

Still, its ready for testing control room tools on. If everything works OK for a couple weeks, we can go to 12 on all the other ones.

  9278   Thu Oct 24 12:00:11 2013 jamieUpdateCDSfb acquisition of slow channels

Quote:

 

 While that would be good - it doesn't address the EDCU problem at hand. After some verbal emailing, Jamie and I find that the master file in target/fb/ actually doesn't point to any of the EDCU files created by any of the FE machines. It is only using the C0EDCU.ini as well as the *_SLOW.ini files that were last edited in 2011 !!!

So....we have not been adding SLOW channels via the RCG build process for a couple years. Tomorrow morning, Jamie will edit the master file and fix this unless I get to it tonight. There a bunch of old .ini files in the daq/ dir that can be deleted too.

I took a look at the situation here so I think I have a better idea of what's going on (it's a mess, as usual):

The framebuilder looks at the "master" file

    /opt/rtcds/caltech/c1/target/fb/master

which lists a bunch of other files that contain lists of channels to acquire.  It looks like there might have been some notion to just use 

    /opt/rtcds/caltech/c1/chans/daq/C0EDCU.ini

as the master slow channels file.  Slow channels from all over the place have been added to this file, presumably by hand.  Maybe the idea was to just add slow channels manually as needed, instead of recording them all by default.  The full slow channels lists are in the

    /opt/rtcds/caltech/c1/chans/daq/C1EDCU_<model>.ini

files, none of which are listed in the fb master file.

There are also these old slow channel files, like

    /opt/rtcds/caltech/c1/chans/daq/SUS_SLOW.ini

There's a perplexing breakdown of channels spread out between these files and C1EDCU.ini:

controls@fb ~ 0$ grep MC3_URS /opt/rtcds/caltech/c1/chans/daq/C0EDCU.ini
[C1:SUS-MC3_URSEN_OVERFLOW]
[C1:SUS-MC3_URSEN_OUTPUT]
controls@fb ~ 0$ grep MC3_URS /opt/rtcds/caltech/c1/chans/daq/MCS_SLOW.ini
[C1:SUS-MC3_URSEN_INMON]
[C1:SUS-MC3_URSEN_OUT16]
[C1:SUS-MC3_URSEN_EXCMON]
controls@fb ~ 0$

why some of these channels are in one file and some in the other I have no idea.  If the fb finds multiple of the same channel if will fail to start, so at least we've been diligent about keeping disparate lists in the different files.

So I guess the question is if we want to automatically record all slow channels by default, in which case we add in the C1EDCU_<model>.ini files, or if we want to keep just adding them in by hand, in which case we keep the status quo.  In either case we should probably get rid of the *_SLOW.ini files (by maybe integrating their channels in C0EDCU.ini), since they're old and just confusing things.

In the mean time, I added C1:FEC-45_CPU_METER to C0EDCU.ini, so that we can keep track of the load there.

 

  9282   Thu Oct 24 17:26:35 2013 jamieUpdateCDSnew dataviewer installed; 'cdsutils avg' now working.

I installed a new version of dataviewer (2.3.2), and at the same time fixed the NDSSERVER issue we were having with cdsutils.  They should both be working now.

The problem turned out to be that I had setup our dataviewer to use the NDSSERVER environment, whereas by default it uses the LIGONDSIP variable.  Why we have two different environment variables that mean basically exactly the same thing, who knows.

  9285   Thu Oct 24 23:12:21 2013 jamieUpdateCDSnew dataviewer installed; no longer works on Ubuntu 10 workstations

Quote:

I installed a new version of dataviewer (2.3.2), and at the same time fixed the NDSSERVER issue we were having with cdsutils.  They should both be working now.

The problem turned out to be that I had setup our dataviewer to use the NDSSERVER environment, whereas by default it uses the LIGONDSIP variable.  Why we have two different environment variables that mean basically exactly the same thing, who knows.

 Dataviewer seems to run fine on Chiara (Ubuntu 12), but not on Rossa or Pianosa (Ubuntu 10), or Megatron, which I assume is also something medium-old.

We get the error:

controls@megatron:~ 0$ dataviewer
Can't find hostname `fb:8088'
Can't find hostname `fb:8088'; gethostbyname(); error=1
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Error in obtaining chan info.
Can't find hostname `fb:8088'
Can't find hostname `fb:8088'; gethostbyname(); error=1

Sadface :(   We also get the popup saying "Couldn't connect to fb:8088"

  9287   Thu Oct 24 23:30:57 2013 jamieUpdateCDSnew dataviewer installed; no longer works on Ubuntu 10 workstations

Quote:

Quote:

I installed a new version of dataviewer (2.3.2), and at the same time fixed the NDSSERVER issue we were having with cdsutils.  They should both be working now.

The problem turned out to be that I had setup our dataviewer to use the NDSSERVER environment, whereas by default it uses the LIGONDSIP variable.  Why we have two different environment variables that mean basically exactly the same thing, who knows.

 Dataviewer seems to run fine on Chiara (Ubuntu 12), but not on Rossa or Pianosa (Ubuntu 10), or Megatron, which I assume is also something medium-old.

We get the error:

controls@megatron:~ 0$ dataviewer
Can't find hostname `fb:8088'
Can't find hostname `fb:8088'; gethostbyname(); error=1
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Warning: Not all children have same parent in XtManageChildren
Error in obtaining chan info.
Can't find hostname `fb:8088'
Can't find hostname `fb:8088'; gethostbyname(); error=1

Sadface :(   We also get the popup saying "Couldn't connect to fb:8088"

Sorry, that was a goof on my part.  It should be working now.

  9288   Fri Oct 25 01:46:33 2013 ranaUpdateCDSfb acquisition of slow channels

Rather than limp along with a broken SLOW channel system, I fixed it so that the EDCU files made during the RCG build actually get used and added to the channel list (and thereby available in DV and trends).

I first started by adding all of the EDCU files. This completely fails; daqd just doesn't start and gives some weird exceptions.

So I removed a bunch of them and it runs OK now with ~15000 channels. Previously we had ~1500 slow channels.

An in-between config tonight had ~58000 channels and was also running fine, but the connection to the FB would time out when using DV after several minutes. Possibly we can fix this by adding some more RAM to the FB (the DAQD process uses up 45% of the CPU and 39% of the 8 GB of RAM).

Another issue in getting this to work was that there were a bunch of channel name conflicts between the old C0EDCU.ini and the sub-system EDCU files that I was trying to add. I went through by hand and deleted all of the duplicates from the old file. The new frame files are 80 MB, the old ones were 66 MB.

I hope that /frames doesn't become full - not sure how that is wiped...

  9297   Sat Oct 26 22:48:55 2013 ranaUpdateCDSNew/old CDS laptop for X-End

  I made the Yoichi laptop into a CDS laptop called 'asia' a few months ago. Somehow I mistakenly gave it the IP address of our little Acer laptop which is called 'farfalla'. This makes farfalla's network not work. I put the old Dell Aldabella by the PMC where farfalla was and am now upgrading farfalla from CentOS to Ubuntu 10.04 LTS 32-bit. I have updated the hostable on linux1 to give farfalla the 230 IP address and let 'asia' keep 225.

  9302   Mon Oct 28 12:53:23 2013 JenneUpdateCDSFarfalla and Asia added to Host Table in Wiki

Quote:

I have updated the hostable on linux1 to give farfalla the 230 IP address and let 'asia' keep 225.

 Neither of these computers were listed in the Martian Host Table in the wiki, so I put them on there.  It's handy to keep this updated, so that we know what IP addresses are available.

  9308   Tue Oct 29 16:51:31 2013 JenneUpdateCDSLSC test points were used up

Masayuki was concerned that some LSC channels were giving him all zeros.  After seeing the error in the terminal window running dataviewer (it said something like 'daqd overloaded'), I checked the lsc model, and sure enough, all the test points were used.

So, I found an entry by Jamie (elog 8431) where he reminds us how to clear the test points.  I followed the instructions, and now we're seeing real data (not digital zeros) again.

  9354   Wed Nov 6 15:12:01 2013 JenneUpdateCDSFB not talking to LSC?

Something funny is going on with the framebuilder's communication with the LSC machine. 

This is a different failure mode / error than I have seen before.  It's not the type of problem that is solved by restarting the mxstreams (that is indicated by also the 2 blocks on top of one another, that are green on the lsc machine right now, being red), although I did try that, before I looked closer and realized that that wasn't the problem.

ssh-ing to c1lsc, and doing a "rtcds restart all" seems to be fixing the problem.  Both c1oaf and c1cal needed another round of restarting, because they needed their BURT buttons pressed manually.  All of the models on the lsc machine are running fine now, though.

Here's a screenshot of the CDS overview screen, with the error lights:

Screenshot-Untitled_Window-1.png

  9357   Wed Nov 6 17:21:58 2013 JamitUpdateCDSFB not talking to LSC?

Quote:

Something funny is going on with the framebuilder's communication with the LSC machine. 

This is a different failure mode / error than I have seen before.  It's not the type of problem that is solved by restarting the mxstreams (that is indicated by also the 2 blocks on top of one another, that are green on the lsc machine right now, being red), although I did try that, before I looked closer and realized that that wasn't the problem.

ssh-ing to c1lsc, and doing a "rtcds restart all" seems to be fixing the problem.  Both c1oaf and c1cal needed another round of restarting, because they needed their BURT buttons pressed manually.  All of the models on the lsc machine are running fine now, though.

Here's a screenshot of the CDS overview screen, with the error lights:

Screenshot-Untitled_Window-1.png

This definitely looks like a timing problem on the c1lsc front end computer.  The red lights on the left mean that the timing synchronization is lost at the user model.  I'm perplexed why it looks like the IOP is not seeing the same error, though, since it should originate at the ADC.  The red lights to the right just mean the timing synchronization is lost with the DAQ, which is too be expected given a timing loss at the front end.

We'll have to take a closer look when this happens again.

  9364   Mon Nov 11 12:19:36 2013 ranaUpdateCDSFE Web view was fixed

Quote:

FE Web view was broken for a long time. It was fixed now.

The problem was that path names were not fixed when we moved the models from the old local place to the SVN structure.

The auto updating script (/cvs/cds/rtcds/caltech/c1/scripts/AutoUpdate/update_webview.cron) is running on Mafalda.

Link to the web view: https://nodus.ligo.caltech.edu:30889/FE/

 Seems partially broken again. Not updating for most of the FE. I've commented out the cron lines for this as well as the mostly broken MEDM Snapshots job. I'm in the process of adding them to the megatron cron (since that machine is at least running 64 bit Ubuntu 12, instead of 32-bit CentOS)

  9366   Tue Nov 12 15:04:35 2013 ranaUpdateCDSFE Web view was fixed

Quote:

 Seems partially broken again. Not updating for most of the FE. I've commented out the cron lines for this as well as the mostly broken MEDM Snapshots job. I'm in the process of adding them to the megatron cron (since that machine is at least running 64 bit Ubuntu 12, instead of 32-bit CentOS)

 https://nodus.ligo.caltech.edu:30889/medm/screenshot.html

Seems to now be working. I made several fixes to the scripts to get it working again:

  1. changed TCSH scripts to BASH. Used /usr/bin/env to find bash.
  2. fixed stdout and stderr redirection so that we could see all error messages.
  3. made the PERL scripts executable. most of the PERL errors are not being logged yet.
  4. fixed paths for the MEDM screens to point to the right directories.
  5. the screen cap only works on screens which pop open on the left monitor, so I edited the screens so that they open up there by default.
  6. moved the CRON jobs from mafalda over to megatron. Mafalda no longer is running any crons.
  7. op540m used to run the 3 projector StripTool displays and have its screen dumped for this web page. Now zita is doing it, but I don't know how to make zita dump her screen.
  9375   Wed Nov 13 18:02:08 2013 JenneUpdateCDSCan't talk to AUXEY?

The restore scripts from the IFO config screen half-failed, with this error:

retrying (1/5)...
retrying (2/5)...
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "c1auxey.martian:5064"
    Source File: ../cac.cpp line 1214
    Current Time: Wed Nov 13 2013 17:24:00.389261330
..................................................................

Jamie, do you know what this might be?  When requested, ETMY was not misaligned or restored, but we got these errors.  So, somehow we're not talking properly to EY, but other things seem fine (the models are running okay, the suspension is damped, etc, etc.)

  9387   Thu Nov 14 22:23:22 2013 JenneUpdateCDSCan't talk to AUXEY?

Quote:

The restore scripts from the IFO config screen half-failed, with this error:

retrying (1/5)...
retrying (2/5)...
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "c1auxey.martian:5064"
    Source File: ../cac.cpp line 1214
    Current Time: Wed Nov 13 2013 17:24:00.389261330
..................................................................

Jamie, do you know what this might be?  When requested, ETMY was not misaligned or restored, but we got these errors.  So, somehow we're not talking properly to EY, but other things seem fine (the models are running okay, the suspension is damped, etc, etc.)

 This problem is now worse - the sliders on IFO_ALIGN for ETMY are white.  I can't telnet to the machine either, although auxex works okay.  Rather, it looks like maybe I'm getting to auxey, but then I'm immediately booted.  I can ping both c1auxex and c1auxey with no problem.
 

Heeeeelllp please.  Is this just a "shut off, then turn back on" problem?  I'm wary of hard rebooting things, with all the warnings and threats in the elog lately.  I've sent an email to Jamie to ping him.

There are some vague instructions in the wiki, but they begin at doing the burt restores, not actually restarting the computers: wiki  Back in July, elog 8858 was written, from which the wiki instructions seem to be based.  But in the elog it says "...went to the /cvs/cds/caltech/target/ area and started to (one by one) inspect all of the targets to see if they were alive.", but I don't know what "inspected" means in this case.  I probably should, since I've been here for something like a millennia, but I don't.


controls@rossa:~ 0$ telnet c1auxey
Trying 192.168.113.60...
Connected to c1auxey.martian.
Escape character is '^]'.
Connection closed by foreign host.
controls@rossa:~ 1$ telnet c1auxex
Trying 192.168.113.59...
Connected to c1auxex.martian.
Escape character is '^]'.

c1auxex >
telnet> ^]
?Invalid command
telnet> exit
?Invalid command
telnet> quit
Connection closed.
controls@rossa:~ 0$ telnet c1auxey
Trying 192.168.113.60...
Connected to c1auxey.martian.
Escape character is '^]'.
Connection closed by foreign host.
  9391   Fri Nov 15 10:19:26 2013 manasaUpdateCDSCan't talk to AUXEY?

Quote:

 

 This problem is now worse - the sliders on IFO_ALIGN for ETMY are white.  I can't telnet to the machine either, although auxex works okay.  Rather, it looks like maybe I'm getting to auxey, but then I'm immediately booted.  I can ping both c1auxex and c1auxey with no problem.
 

Heeeeelllp please.  Is this just a "shut off, then turn back on" problem?  I'm wary of hard rebooting things, with all the warnings and threats in the elog lately.  I've sent an email to Jamie to ping him.

There are some vague instructions in the wiki, but they begin at doing the burt restores, not actually restarting the computers: wiki  Back in July, elog 8858 was written, from which the wiki instructions seem to be based.  But in the elog it says "...went to the /cvs/cds/caltech/target/ area and started to (one by one) inspect all of the targets to see if they were alive.", but I don't know what "inspected" means in this case.  I probably should, since I've been here for something like a millennia, but I don't.

 

This is what was done (as I recollect) when we said "inspected":Tenet into the computer, ping them and look at the status. Since c1auxey is not responding, here is how c1auxex responds.

controls@rossa:/cvs/cds/caltech/target 0$ telnet c1auxex
Trying 192.168.113.59...
Connected to c1auxex.martian.
Escape character is '^]'.

c1auxex > h
  1  i
  2  -help
  3  --help
  4  h
  5  2
  6  h
  7  -help
  8  i
  9  h
value = 0 = 0x0
c1auxex > i

  NAME        ENTRY       TID    PRI   STATUS      PC       SP     ERRNO  DELAY
---------- ------------ -------- --- ---------- -------- -------- ------- -----
tExcTask   _excTask       fde244   0 PEND          87094   fde1ac   3006b     0
tLogTask   _logTask       fdb944   0 PEND          87094   fdb8a8       0     0
tShell     _shell         ddad00   1 READY         6d974   dda9c8  3d0001     0
tRlogind   _rlogind       fbc11c   2 PEND          2b604   fbbdf4       0     0
tTelnetd   _telnetd       fba278   2 PEND          2b604   fba1a8       0     0
tTelnetOutT_telnetOutTa   db7578   2 READY         2b604   db72e0       0     0
tTelnetInTa_telnetInTas   db6060   2 READY         2b5dc   db5d68       0     0
callback   _callbackTas   f7941c  40 PEND          2b604   f793d4       0     0
scanEvent  ee7ca8         ecacb4  41 PEND          2b604   ecac6c       0     0
tNetTask   _netTask       fd75b8  50 READY         6be6c   fd7550       0     0
scanPeriod ee78f8         ecd554  53 READY         6d192   ecd508       0     0
scanPeriod ee78f8         f23e48  54 DELAY         6d192   f23dfc       0     6
tFtpdTask  _ftpdTask      fb7848  55 PEND          2b604   fb778c       0     0
scanPeriod ee78f8         f266e8  55 READY         6d192   f2669c       0     0
scanPeriod ee78f8         f38678  56 READY         6d192   f3862c       0     0
callback   _callbackTas   f7bcbc  57 PEND          2b604   f7bc74       0     0
scanPeriod ee78f8         f906d8  57 DELAY         6d192   f9068c       0    59
scanPeriod ee78f8         f995ac  58 DELAY         6d192   f99560       0   238
scanPeriod ee78f8         f9c908  59 DELAY         6d192   f9c8bc       0   538
callback   _callbackTas   fa4c1c  65 PEND          2b604   fa4bd4       0     0
scanOnce   ee7764         f9f96c  65 PEND          2b604   f9f92c       0     0
epicsPrint f0501c         e88fa0  70 PEND          2b604   e88f64   c0002     0
ts_Casync  ee5bae         f76b7c  70 DELAY         6d192   f76880  3d0004   178
tPortmapd  _portmapd      fb8d60 100 PEND          2b604   fb8c2c      16     0
EgRam      ea00e4         fa14ac 100 PEND          2b604   fa1458       0     0
CA client  _camsgtask     d85878 180 PEND          2b604   d85774  3d0004     0
CA client  _camsgtask     df91e8 180 PEND          2b604   df90e4       0     0
CA client  _camsgtask     d98bf4 180 PEND          2b604   d98af0       0     0
CA client  _camsgtask     e03cd0 180 PEND          2b604   e03bcc       0     0
CA client  _camsgtask     ddf2b8 180 PEND          2b604   ddf1b4       0     0
CA client  _camsgtask     faaec8 180 PEND          2b604   faadc4       0     0
CA client  _camsgtask     d79f3c 180 PEND          2b604   d79e38       0     0
CA TCP     _req_server    f305dc 181 PEND          2b604   f30540       0     0
CA repeaterf109e2         f215a8 181 PEND          2b604   f21474       0     0
CA event   _event_task    d7fe58 181 PEND          2b604   d7fe10       0     0
CA event   _event_task    d6ce5c 181 PEND          2b604   d6ce14       0     0
CA event   _event_task    dab7e0 181 PEND          2b604   dab798       0     0
CA event   _event_task    d76efc 181 PEND          2b604   d76eb4       0     0
CA event   _event_task    d9bddc 181 PEND          2b604   d9bd94       0     0
CA event   _event_task    d9a864 181 PEND          2b604   d9a81c       0     0
CA event   _event_task    da8d8c 181 PEND          2b604   da8d44       0     0
CA UDP     _cast_server   f2f064 182 READY        efcabe   f2efe4       0     0
CA online  _rsrv_online   f2d84c 183 DELAY         6d192   f2d7bc       0   265
EV save_res_event_task    de88dc 189 PEND          2b604   de8894   3006b     0
save_restor_save_restor   df61cc 190 PEND          2b604   df5c44  3d0002     0
RD save_res_cac_recv_ta   fb47d8 191 READY         2b604   fb46a4  3d0004     0
logRestart f05d42         e861c0 200 PEND+T        2b604   e86174      33  1714
taskwd     ef4d46         e85030 200 DELAY         6d192   e84f7c       0   224
value = 0 = 0x0
c1auxex >
telnet> quit
Connection closed.
controls@rossa:/cvs/cds/caltech/target 0$

  9393   Fri Nov 15 10:49:55 2013 jamieUpdateCDSCan't talk to AUXEY?

Please just try rebooting the vxworks machine.  I think there is a key on the card or create that will reset the device.  These machines are "embeded" so they're designed to be hard reset, so don't worry, just restart the damn thing and see if that fixes the problem.

  9394   Fri Nov 15 12:00:28 2013 KojiUpdateCDSCan't talk to AUXEY?

Quote:

Please just try rebooting the vxworks machine.  I think there is a key on the card or create that will reset the device.  These machines are "embeded" so they're designed to be hard reset, so don't worry, just restart the damn thing and see if that fixes the problem.

 Don't forget to run burtrestore for the target.

  9395   Fri Nov 15 12:38:50 2013 JenneUpdateCDSCan't talk to AUXEY?

Quote:

Please just try rebooting the vxworks machine.  I think there is a key on the card or create that will reset the device.  These machines are "embeded" so they're designed to be hard reset, so don't worry, just restart the damn thing and see if that fixes the problem.

 This is what I remember doing all the time when Rob was around, but with all the new computers, I forgot whether or not this was allowed for the slow computers.

Anyhow, I went down there and keyed the crate, but auxey isn't coming back.  I'll give it a few more minutes and check again, but then I might go and power cycle it again.  If that doesn't work, we may have a much bigger problem.

  9396   Fri Nov 15 13:26:00 2013 JenneUpdateCDSAUXEY is back

Quote:

Quote:

Please just try rebooting the vxworks machine.  I think there is a key on the card or create that will reset the device.  These machines are "embeded" so they're designed to be hard reset, so don't worry, just restart the damn thing and see if that fixes the problem.

 This is what I remember doing all the time when Rob was around, but with all the new computers, I forgot whether or not this was allowed for the slow computers.

Anyhow, I went down there and keyed the crate, but auxey isn't coming back.  I'll give it a few more minutes and check again, but then I might go and power cycle it again.  If that doesn't work, we may have a much bigger problem.

 I went and keyed the crate again, and this time the computer came back.  I burt restored to Nov 10th.  ETMY is damping again.

  9402   Mon Nov 18 21:20:54 2013 JenneUpdateCDSCan't talk to AUXEY?

Quote:

The restore scripts from the IFO config screen half-failed, with this error:

retrying (1/5)...
retrying (2/5)...
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "c1auxey.martian:5064"
    Source File: ../cac.cpp line 1214
    Current Time: Wed Nov 13 2013 17:24:00.389261330
..................................................................

Jamie, do you know what this might be?  When requested, ETMY was not misaligned or restored, but we got these errors.  So, somehow we're not talking properly to EY, but other things seem fine (the models are running okay, the suspension is damped, etc, etc.)

 The auxey machine is back, in that I can interact with the IFO_ALIGN sliders, and they actually make the optic move, but I still can't read and write to and from the EPICs channels:

controls@rossa:/opt/rtcds/caltech/c1/medm/MISC/ifoalign/burt 0$ cdsutils read C1:SUS-ETMY_PIT_COMM
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "c1auxey.martian:5064"
    Source File: ../cac.cpp line 1214
    Current Time: Mon Nov 18 2013 21:13:52.044973819
..................................................................
Could not connect to channel (timeout=2s): C1:SUS-ETMY_PIT_COMM
controls@rossa:/opt/rtcds/caltech/c1/medm/MISC/ifoalign/burt 1$ cdsutils read C1:SUS-ETMY_YAW_COMM
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "c1auxey.martian:5064"
    Source File: ../cac.cpp line 1214
    Current Time: Mon Nov 18 2013 21:14:07.040168660
..................................................................
Could not connect to channel (timeout=2s): C1:SUS-ETMY_YAW_COMM
controls@rossa:/opt/rtcds/caltech/c1/medm/MISC/ifoalign/burt 1$

This is also causing trouble for the BURT save and BURT restore scripts, that are called from the IFO_ALIGN screen.  If I look at the log that is written from an attempted 'save' of the slider values, I see:

**** READ BURT LOGFILE

--- Start processing files
file >/opt/rtcds/caltech/c1/medm/MISC/ifoalign/burt/ETMY.req<
preprocessing ... done
pv >C1:SUS-ETMY_PIT_COMM< nreq=-1
pv >C1:SUS-ETMY_YAW_COMM< nreq=-1
--- End processing files

--- Start searches
C1:SUS-ETMY_PIT_COMM ... ca_search_and_connect() ... OK
C1:SUS-ETMY_YAW_COMM ... ca_search_and_connect() ... OK
--- End searches
Waiting for 2 outstanding search(es) ...
Waiting for 2 outstanding search(es) ...
did not find 2

--- Start reads
C1:SUS-ETMY_PIT_COMM ... not connected so no ca_array_get_callback()
C1:SUS-ETMY_YAW_COMM ... not connected so no ca_array_get_callback()
--- End reads

--- Start wait for pending reads

-- End wait for pending reads 0 outstanding read(s)

**** END BURT LOGFILE

The burt save file has no values in it.  Even if I copy over the ETMX save file and put in the correct channel names and values, a burt restore is unsuccessful. 

So, I can do locking tonight by restoring and misaligning by hand, but this sucks, and needs to be fixed. Other optics (at least PRM, SRM, ETMX) seem to be working just fine.  It's just ETMY that has a problem.

 

  9412   Tue Nov 19 15:04:14 2013 JenneUpdateCDSCan talk to AUXEY again

The ETMY sliders on IFO_ALIGN were white again this morning, so I went down to the Yend and pushed the RESET button on auxey.  I then did a burt restore to 00:07am this morning for both auxey and auxex (since the stickers on the machines are still the old naming convention, I wonder if the autoburt is also backwards, so I did both).  Now the 'save' and 'restore' scripts for ETMY are working again.

Hopefully it's all better now, but I'll keep an eye on it.

  9422   Fri Nov 22 09:54:22 2013 SteveUpdateCDSDAQ?

Jamie, I think the computers know that you are away. c1lsc keeps going down.

The short time plots are correct.

Attachment 1: comp8d.png
comp8d.png
  9425   Mon Nov 25 10:57:14 2013 KojiUpdateCDSwoes on the X-end hosts

This morning I came in the 40m then found
1) c1auxex was throwing out the same errors as recently seen.
2) c1iscex processes had errors which persisted even after the mx stream reset.

1) c1auxex - fixed

Tried telnet c1auxex => rejected by the host

Went down to the south end. Power cycled the target. Came back to the control room.
=> Confirmed the epics read/write is back.
Burtrestored the epics vars for the target to the snapshot on 31th Oct at 5:07.

2) c1iscex - still not fixed

ssh c1iscex
rtcds restart all
=> c1x01 is still in red. 
Followed the procedure on the elog entry 9007. => Still the same error.

At least c1x01 is stalled. Here is the status.

Sync Source is TDS.
C1:DAQ-DC0_C1X01_STATUS is 0x2bad.
C1:DAQ-DC0_C1X01_CRC_SUM stays 0.
The screen shot is attached.

dmesg related to c1x01

controls@c1iscex ~ 0$ dmesg |grep c1x01
[   32.152010] c1x01: startup time is 1069440223
[   32.152012] c1x01: cpu clock 3000325
[   32.152014] c1x01: Epics shmem set at 0xffffc9001489c000
[   32.152208] c1x01: IPC at 0xffffc90018947000
[   32.152209] c1x01: Allocated daq shmem; set at 0xffffc9000480c000
[   32.152210] c1x01: configured to use 4 cards
[   32.152211] c1x01: Initializing PCI Modules
[   32.152226] c1x01: ADC card on bus b; device 4 prim b
[   32.152227] c1x01: adc card on bus b; device 4 prim b
[   32.154801] c1x01: pci0 = 0xdc300400
[   32.154837] c1x01: pci2 = 0xdc300000
[   32.154842] c1x01: ADC I/O address=0xdc300000  0xffffc90003f62000
[   32.154845] c1x01: BCR = 0x84060
[   32.154858] c1x01: RAG = 0x117d8
[   32.154861] c1x01: BCR = 0x84260
[   32.583220] c1x01: SSC = 0x16
[   32.583223] c1x01: IDBC = 0x1f
[   32.583236] c1x01: DAC card on bus 14; device 4 prim 14
[   32.583237] c1x01: dac card on bus 14; device 4
[   32.584527] c1x01: pci0 = 0xdc400400
[   32.584546] c1x01: dac pci2 = 0xdc400000
[   32.584551] c1x01: DAC I/O address=0xdc400000  0xffffc90003f6a000
[   32.584555] c1x01: DAC BCR = 0x810
[   32.584678] c1x01: DAC BCR after init = 0x30080
[   32.584681] c1x01: DAC CSR = 0xffff
[   32.584687] c1x01: DAC BOR = 0x3415
[   32.584693] c1x01: set_8111_prefetch: subsys=0x8114; vendor=0x10e3
[   32.584722] c1x01: Contec 1616 DIO card on bus 23; device 0
[   32.593429] c1x01: contec 1616 dio pci2 = 0x4001
[   32.593430] c1x01: contec 1616 diospace = 0x4000
[   32.593434] c1x01: contec dio pci2 card number= 0x0
[   32.593439] c1x01: Contec BO card on bus 18; device 0
[   32.593447] c1x01: contec dio pci2 = 0x3001
[   32.593448] c1x01: contec32L diospace = 0x3000
[   32.593451] c1x01: contec dio pci2 card number= 0x0
[   32.593456] c1x01: 5565 RFM card on bus 7; device 4
[   32.597218] Modules linked in: c1x01(+) open_mx mbuf
[   32.599939]  [<ffffffffa002e430>] mapRfm+0x71/0x392 [c1x01]
[   32.600199]  [<ffffffffa002ec91>] mapPciModules+0x540/0x8cf [c1x01]
[   32.600458]  [<ffffffffa002f2c1>] init_module+0x2a1/0x9d6 [c1x01]
[   32.600717]  [<ffffffffa002f020>] ? init_module+0x0/0x9d6 [c1x01]
[   32.616194] c1x01: RFM address is 0xd8000000
[   32.616196] c1x01: CSR address is 0xdc000000
[   32.616206] c1x01: Board id = 0x65
[   32.616209] c1x01: DMA address is 0xdc000400
[   32.616213] c1x01: 5565DMA at 0xffffc90003f72400
[   32.616215] c1x01: 5565 INTCR = 0xf010100
[   32.616217] c1x01: 5565 INTCR = 0xf000000
[   32.616218] c1x01: 5565 MODE = 0x43
[   32.616220] c1x01: 5565 DESC = 0x0
[   32.616232] c1x01: 5 PCI cards found
[   32.616233] c1x01: ***************************************************************************
[   32.616234] c1x01: 1 ADC cards found
[   32.616235] c1x01:     ADC 0 is a GSC_16AI64SSA module
[   32.616236] c1x01:         Channels = 64
[   32.616236] c1x01:         Firmware Rev = 34
[   32.616238] c1x01: ***************************************************************************
[   32.616239] c1x01: 1 DAC cards found
[   32.616239] c1x01:     DAC 0 is a GSC_16AO16 module
[   32.616240] c1x01:         Channels = 16
[   32.616241] c1x01:         Filters = None
[   32.616242] c1x01:         Output Type = Differential
[   32.616242] c1x01:         Firmware Rev = 6
[   32.616244] c1x01: MASTER DAC SLOT 0 1
[   32.616244] c1x01: ***************************************************************************
[   32.616246] c1x01: 0 DIO cards found
[   32.616246] c1x01: ***************************************************************************
[   32.616248] c1x01: 0 IIRO-8 Isolated DIO cards found
[   32.616248] c1x01: ***************************************************************************
[   32.616250] c1x01: 0 IIRO-16 Isolated DIO cards found
[   32.616250] c1x01: ***************************************************************************
[   32.616252] c1x01: 1 Contec 32ch PCIe DO cards found
[   32.616252] c1x01: 1 Contec PCIe DIO1616 cards found
[   32.616253] c1x01: 0 Contec PCIe DIO6464 cards found
[   32.616254] c1x01: 2 DO cards found
[   32.616255] c1x01: TDS controller 0 is at 0
[   32.616256] c1x01: Total of 4 I/O modules found and mapped
[   32.616257] c1x01: ***************************************************************************
[   32.616259] c1x01: 1 RFM cards found
[   32.616260] c1x01:     RFM 0 is a VMIC_5565 module with Node ID 41
[   32.616261] c1x01: address is 0x18d80000
[   32.616261] c1x01: ***************************************************************************
[   32.616262] c1x01: Initializing space for daqLib buffers
[   32.616263] c1x01: Initializing Network
[   32.616264] c1x01: Found 1 frameBuilders on network
[   32.616265] c1x01: Epics burt restore is 0
[   33.616012] c1x01: Epics burt restore is 0
[   34.617018] c1x01: Epics burt restore is 0
[   35.618017] c1x01: Epics burt restore is 0
[   36.619011] c1x01: Epics burt restore is 0
[   37.621007] c1x01: Epics burt restore is 0
[   38.622008] c1x01: Epics burt restore is 0
[   39.733257] c1x01: Sync source = 4
[   39.733257] c1x01: Waiting for EPICS BURT Restore = 1
[   39.793001] c1x01: Waiting for EPICS BURT 0
[   39.793001] c1x01: BURT Restore Complete
[   39.793001] c1x01: Found a BQF filter 0
[   39.793001] c1x01: Found a BQF filter 1
[   39.793001] c1x01: Initialized servo control parameters.
[   39.794002] c1x01: DAQ Ex Min/Max = 1 3
[   39.794002] c1x01: DAQ XEx Min/Max = 3 53
[   39.794002] c1x01: DAQ Tp Min/Max = 10001 10007
[   39.794002] c1x01: DAQ XTp Min/Max = 10007 10507
[   39.794002] c1x01: DIRECT MEMORY MODE of size 64
[   39.794002] c1x01: daqLib DCU_ID = 19
[   39.794002] c1x01: Calling feCode() to initialize
[   39.794002] c1x01: entering the loop
[   39.794002] c1x01: ADC setup complete
[   39.794002] c1x01: DAC setup complete
[   39.794002] c1x01: writing BIO 0
[   39.814002] c1x01: writing DAC 0
[   39.814002] c1x01: Triggered the ADC
[   40.874003] c1x01: timeout 0 1000000
[   40.874003] c1x01: exiting from fe_code()

 

Attachment 1: Screenshot.png
Screenshot.png
  9426   Mon Nov 25 12:57:54 2013 JamieUpdateCDStiming problem at c1iscex IO chassis

There is definitely a timing distribution malfunction at the c1iscex IO chassis.  There is no timing link between the "Master Timer Sequencer D050239" at the 1X6 and the c1iscex IO chassis.  Link lights at both ends are dead.  No timing, no running models.

It does not appear to be a problem with the Master Timer Sequencer.  I moved the c1iscey link to the J15 port on the sequencer and it worked fine.  This means its either a problem with the fiber or the timing card in the IO chassis.  The IO timing card is powered and does have what appear to be normal status lights on (except for the fiber link lights).  It's getting what I think is the nominal 4V power.  The connection to the IO chassis backplane board look ok.  So maybe it's just a dead fiber issue?

I do not know what could have been the problem with c1auxex, or if it's related to the fast timing issue.

  9427   Mon Nov 25 17:28:33 2013 JenneUpdateCDStiming problem at c1iscex IO chassis

Quote:

There is definitely a timing distribution malfunction at the c1iscex IO chassis.  There is no timing link between the "Master Timer Sequencer D050239" at the 1X6 and the c1iscex IO chassis.  Link lights at both ends are dead.  No timing, no running models.

It does not appear to be a problem with the Master Timer Sequencer.  I moved the c1iscey link to the J15 port on the sequencer and it worked fine.  This means its either a problem with the fiber or the timing card in the IO chassis.  The IO timing card is powered and does have what appear to be normal status lights on (except for the fiber link lights).  It's getting what I think is the nominal 4V power.  The connection to the IO chassis backplane board look ok.  So maybe it's just a dead fiber issue?

I do not know what could have been the problem with c1auxex, or if it's related to the fast timing issue.

 Steve and Koji looked around, and called around, and there seem to be no spare fibers that are long enough to reach the end, so Steve has ordered

"Tripp Lite N520-30M 100' Multimode Duplex 50/125 Fiber Optic Patch Cable LC/LC"

 and it should be here tomorrow.

  9428   Wed Nov 27 14:45:49 2013 JenneUpdateCDStiming problem at c1iscex IO chassis

 [Koji, Jenne]

The new fiber arrived today, and we tried it out.  No luck.  We think it is the timing card, so we'll need to get one, since we can't find a spare.

Order of operations:

* Lay new fiber on floor, plugged it in at both ends, saw no fiber link lights.

* From control room, killed all models running on c1iscex, shutdown computer.  Still no link lights.

* Power cycled computer and IO chassis.

* Tried plugging new fiber into different port on Master Timing Sequencer, with other end still plugged in to c1iscex.  Still no link lights.

* Looked around with flashlight at Xend IO chassis.  The board that the fiber is connected to does not have a power light, although the board next to it has 2.  We compared with the SUS IO chassis, and the board there with the fiber has one power light, plus the fiber link lights, as well as 2 on the board next to the fiber.  So, perhaps there's a problem with power distribution on the timing board at the Xend? 

* Unplugged and replugged the power connector to the timing board, inside the IO chassis, board next to the fiber's board got lights back, but the fiber's board did not.  However, power must be going through the board with the fiber attached, to the next board, so there's power at least on some part of the timing board, just not the whole thing.

From this, we conclude that the blue fiber that was in place is probably fine (or is not found guilty), and that we need a replacement timing board.  Koji didn't find one in the "CDS stuff" boxes underneath the Jenne Laser, and I feel like I recall Jamie saying that we would have to get a spare from somewhere else.  We rolled up the new spare fiber, and put it in the box with other "CDS Stuff" under the Jenne Laser table.

  9429   Wed Nov 27 16:29:21 2013 JenneUpdateCDSAccidentally turned off SUS IO chassis

[Jenne, Koji]

I was trying to lock the Yarm, and saw that I was not getting signals to go between the LSC and SCY models.  I had digital zeros for TRY, and when I overrode the trigger and tried to force signal to ETMY, I had digital zeros at the SUS-ETMY_LSC input. The corresponding filter bank in the rfm model was receiving signals, so the Dolphin connection between LSC and SUS was okay, it was just the RFM connection going to the end station that wasn't succeeding. 

Koji restarted the c1scy model, and then went inside the IFO room, and found that the SUS IO chassis power was offWe must have accidentally turned it off while we were in there earlier.  Koji turned on the power, and also restarted the rfm model, and we now have real signals going back and forth. 

Yarm is locked, ASS worked nicely, etc, etc, so things seem normal again (with the Yarm....ETMX stuff is still out of order).

  9432   Mon Dec 2 14:24:10 2013 SteveUpdateCDScomputer problems

Rack 1x6 is very noisy.

 SunFire X4600 computer: FB (directly below Megatron) has it's yellow warning light on. It must be loosing one of it's  fan bearings.

 

Jetstore's error message: IDE channel #2 reading error

Attachment 1: c1iscex.png
c1iscex.png
Attachment 2: 1X6.JPG
1X6.JPG
  9433   Mon Dec 2 16:04:47 2013 JamieUpdateCDSc1iscex timing problem mysteriously disappears??? (thanksgiving miracle???)

Quote:

There is definitely a timing distribution malfunction at the c1iscex IO chassis.  There is no timing link between the "Master Timer Sequencer D050239" at the 1X6 and the c1iscex IO chassis.  Link lights at both ends are dead.  No timing, no running models.

It does not appear to be a problem with the Master Timer Sequencer.  I moved the c1iscey link to the J15 port on the sequencer and it worked fine.  This means its either a problem with the fiber or the timing card in the IO chassis.  The IO timing card is powered and does have what appear to be normal status lights on (except for the fiber link lights).  It's getting what I think is the nominal 4V power.  The connection to the IO chassis backplane board look ok.  So maybe it's just a dead fiber issue?

I do not know what could have been the problem with c1auxex, or if it's related to the fast timing issue.

I just got over here from Downs, where I managed to convince Todd to let me borrow one of their three remaining timing slave boards for c1iscex.  I walked down to the X end to replace the board only to discover that the link light on the existing timing board was back!  c1iscex was not responding, so I hard rebooted the machine, and everything came up rosy (all green!):

festatus.png

To repeat, I DID NOTHING.  The thing was working when I got here.  I have no idea when it came back, or how, but it's at least working for the moment.  I re-enabled the watchdog for ETMX SUS and it's now damped normally.

I'm going to hold on to the timing card for a couple of days, in case the failure comes back, but we'll need to return it to Downs soon, and probably think about getting some spare backups from Columbia.

  9434   Mon Dec 2 17:05:13 2013 JenneUpdateCDSc1iscex timing problem mysteriously disappears??? (thanksgiving miracle???)

Steve was trying to do something to it this morning, but I'm not exactly clear on what it was.  Maybe that helped?  Steve, can you tell us what you were trying to do this morning?

  9435   Tue Dec 3 07:42:23 2013 SteveUpdateCDSc1iscex timing problem mysteriously disappears??? (thanksgiving miracle???)

Quote:

Steve was trying to do something to it this morning, but I'm not exactly clear on what it was.  Maybe that helped?  Steve, can you tell us what you were trying to do this morning?

 I was trying to repeat  elog 9007  I did only get to line 2 of the Solution by Koji when Ottavia shut down, where I was working. This was all what I did.

  9436   Tue Dec 3 17:08:06 2013 KojiUpdateCDScomputer problems

It seems that the front fan unit was running at the full speed. The fan itself seems still OK.

I talked with Jamie and make a power cycling (i.e. shutdown gracefully, unplug the power supply cables (x4), plug them in again, and pushed the power button)

The warning signal went off and the fan is quiet. FOR NOW.

Now, daqd and ndsd is down.

FB cannot mount /opt/rtcds and /cvs/cds during its boot.

After mounting these manually, I tried to run /opt/rtcds/caltech/c1/target/fb/start_daqd.inittab and /opt/rtcds/caltech/c1/target/fb/start_nds.inittab
but they don't keep running.

I'll be back to this issue tomorrow with Jamie's help.

  9437   Wed Dec 4 12:02:39 2013 KojiUpdateCDSFB restored

Now FB is fixed: daqd and nds are running


When I rebooted FB, I noticed that any of the nfs file systems were not mounted.
I started tracking down the issues from here.

I googled the common issues of the nfs mounting during the boot sequence.
- It is good to give "_netdev" option to fstab to mount the system after the network connection is established.

- "auto" option specifies that the file system is mounted when mount -a is run

Resulting /etc/fstab is this:

/dev/sdb1                            /            ext3    noatime                    0 1
/swapfile                            none         swap    sw                         0 0
shm                                  /dev/shm     tmpfs   nodev,nosuid,noexec        0 0
/dev/sda1                            /frames      ext3    noatime                    0 0
linux1:/home/cds/                    /cvs/cds     nfs     _netdev,auto,rw,bg,soft    0 0
linux1:/home/cds/rtcds               /opt/rtcds   nfs     _netdev,auto,rw,bg,soft    0 0
linux1:/home/cds/rtapps              /opt/rtapps  nfs     _netdev,auto,rw,bg,soft    0 0
linux1:/home/cds/caltech/apps/linux  /opt/apps    nfs     _netdev,auto,rw,bg,soft    0 0

But this didn't help mounting the nfs file systems at boot yet. I dug into google again and found a command "/sbin/rc-update".
"/sbin/rc-update show" shows what services are activated at boot. It did not include "nfsmount". So the following command
was executed

 

> sudo /sbin/rc-update add nfsmount boot

> /sbin/rc-update show

* Broken runlevel entry: /etc/runlevels/boot/portmap
            bootmisc | boot                         
             checkfs | boot                         
           checkroot | boot                         
               clock | boot                         
         consolefont | boot                         
               dcron |      default                 
               dhcpd |      default                 
            hostname | boot                         
            in.tftpd | boot                         
             keymaps | boot                         
               local |      default nonetwork       
          localmount | boot                         
             modules | boot                         
               monit |      default                 
                  mx |      default                 
            net.eth0 |      default                 
              net.lo | boot                         
            netmount |      default                 
                 nfs | boot                         
            nfsmount | boot                         
          ntp-client | boot default                 
           rmnologin | boot                         
           rpc.statd | boot                         
                sshd | boot                         
           syslog-ng | boot                         
      udev-postmount |      default                 
             urandom | boot                         
              xinetd |      default

After rebooting, I confirmed that the nfs file systems are correctly mounted
and daqd and nds are automatically started.

This means that FB had never been configured to run correctly at boot. Shame on you!

ELOG V3.1.3-