c1iscex does not even see its 32 channel Binary output card. This means we have no control over the state of the analog whitening and dewhitening filters. The ADC, DAC, and the 1616 Binary Input/Output cards are recognized and working.
Tried recreating the IOP code from the known working c1x02 (from the c1sus front end), but that didn't help.
Checked seating of the card, but it seems correctly socketed and tightened down nicely with a screw.
Tomorrow will try moving cards around and see if there's an issue with the first slot, which the Binary Output card is in.
The ETMX is currently damping, including POS, PIT, YAW and SIDE degrees of freedom. However, the gds screen is showing a 0x2bad status for the c1scx front end (the IOP seems fine with a 0x0 status). So for the moment, I can't seem to bring up c1scx testpoints. I was able to do so earlier when I was testing the status of the binary outputs, so during one of the rebuilds, something broke. I may have to undo the SVN update and/or a change made by Alex today to allow for longer filter bank names beyond 19 characters.
c1iscex was spamming the network with error messages.
Updated the front end codes to current standards (they were on the order of months out of date). After fixing them up and rebuilding the codes on c1iscex, it no longer had problems connecting to the frame builder.\
I can look at test points for ETMX. It is not currently damping however.
Move filters for ETMX into the correct files.
Need to add a Binary output blue and gold box to the end rack, and plug it into the binary output card. Confirm the binary output logic is correct for the OSEM whitening, coil dewhitening, and QPD whitening boards.
Get ETMX damped.
Figure out what we're going to do with the aux crate which is currently running y-end code at the new x-end. Koji suggested simply swapping auxilliary crates - this may be the easiest. Other option would be to change the IP address, so that when it PXE boots it grabs the x-end code instead of the y-end code.
Current CDS status:
Here is what was done (Jamie will correct me if I am mistaken).
So while we are in a better state now, the problem isn't fully solved.
Comment: seems like there is an in-built timeout for testpoints opened with DTT - if the measurement is inactive for some time (unsure how much exactly but something like 5mins), the testpoint is automatically closed.
This morning, all the c1iscex models were dead. Attachment #1 shows the state of the cds overview screen when I came in. The machine itself was ssh-able, so I just restarted all the models and they came back online without fuss.
This was me. I had rebooted that machine and hadn't restarted the models. Sorry for the confusion.
c1iscex is dead again. Red lights, no "breathing" on the FE status screen.
The c1iscex machine itself wasn't dead, the models were just not running. Here are the last messages in dmesg:
[130432.926002] c1spx: ADC TIMEOUT 0 7060 20 7124
[130432.926002] c1scx: ADC TIMEOUT 0 7060 20 7124
[130433.941008] c1x01: timeout 0 1000000
[130433.941008] c1x01: exiting from fe_code()
I'm guessing maybe the timing signal was lost, so the ADC stopped clocking. Since the ADC clock is the everything clock, all the "fe" code (ie. models) aborted. Not sure what would have caused it.
I restarted all the models ("rtcds restart all") and everything came up fine. Obviously we should keep our eyes on things, and note if anything strange was happening if this happens again.
c1iscex was behaving very strangely this morning. Steve earlier reported that he was having trouble pulling up some channels from the c1scx model. I went to investigate and noticed that indeed some channels were not responding.
While I was in the middle of poking around, c1iscex stopped responding altogether, and became completely unresponsive. I walked down there and did a hard reset. Once it rebooted, and I did a burt restore from early this morning, everything appeared to be working again.
The fact that problems were showing up before the machine crashed worries me. I'll try to investigate more this afternoon.
I started to modify the c1asx model to reduce the RFM model from hitting its max time.
Instead of bringing in ASS, I have modified ASX to do everything and only the clock signals to ITMX pitch and yaw are now going through RFM. RFM is still hitting 62usec and I suppose that is because of the problems with c1iscex.
c1iscex not happy
Cause and symptoms
While restarting the models, c1iscex crashed a couple of times because of some errors and had to be powercycled. The models were modified and they seem to start ok.
But it looks like there is something wrong with c1iscex since the models were started. The GPS time is off and C1:DAQ-DC0_C1X01_CRC_SUM keeps building up even for c1x01 which was left untouched.
1. Since c1x01 ans c1spx were not touched,c1scx and c1asx were killed and we tried to start the other models. This did not help.
2. Koji did a manual daqd restart which did not help either.
We are leaving c1iscex as is for the time being and calling Jamie for help.
P.S. While making the models, I had created IPCx_PCIE blocks in c1iscex which do not exist. I changed them to RFM and SHMEM blocks. This did not allow me to compile the model and was only spitting errors of IPCx mismatch. After some struggle and elog search I figured out from an old elog that eventhough the IPCx blocks are changed in the model, the old junk exists in the ipc file in chans directory. I deleted all junk channels related to the ASX model. The model compiled right away.
Sorrensen ps ouput of +15V at rack 1X9 was current limited to 10.3V @ 2A
Increased threshold to 2.1A and the voltage is up to 14.7V
I'm not sure why, but c1iscex did not want to do an mxstream restart. It would complain at me that "* ERROR: mx_stream is already stopping."
Koji suggested that I reboot the machine, so I did. I turned off the ETMX watchdog, and then did a remote reboot. Everything came back nicely, and the mx_stream process seems to be running.
* ERROR: mx_stream is already stopping.
I swapped out the IO chassis which could only handle 3 PCIe cards with the another chassis which has space for 17, but which previously had timing issues. A new cable going between the timing slave and the rear board seems to have fixed the timing issues.
I'm hoping to get a replacement PCI extension board which can handle more than 3 cards this week from Rolf and then eventually put it in the Y-end rack. I'm also still waiting for a repaired Host interface board to come in for that as well.
At this point, RFM is working to c1iscex, but I'm still debugging the binary outputs to the analog filters. As of this time they are not working properly (turning the digital filters on and off seems to have no effect on the transfer function measured from an excitation in SUSPOS, all the way around to IN1 of the sensor inputs (but before measuring the digital fitlers). Ideally I should see a difference when I switch the digital filters on and off (since the analog ones should also switch on and off), but I do not.
We noticed that the iscex computer is still down, but the IOP is (was) running. When we sat down to look at it, c1x01 was 'breathing', had a non-zero CPU_METER time, and the error was 0x4000, which I've never seen before. The fb connection was still red though. Also, it is claiming that its sync source is 1pps, not TDS like it usually is.
Since things were different, Koji restarted the 2 other models running on iscex, with no resulting change. We then did a 'rtcds restart all', and the IOP is no longer breathing, and the error message has changed to 0xbad. The sync source is still 1pps.
Moral of the story: c1iscex is still down, but temporarily showed signs of life that we wanted to record.
There's definitely a timing issue with this machine. I looked at it a bit yesterday. I'll try to get to it by the end of the week.
There is definitely a timing distribution malfunction at the c1iscex IO chassis. There is no timing link between the "Master Timer Sequencer D050239" at the 1X6 and the c1iscex IO chassis. Link lights at both ends are dead. No timing, no running models.
It does not appear to be a problem with the Master Timer Sequencer. I moved the c1iscey link to the J15 port on the sequencer and it worked fine. This means its either a problem with the fiber or the timing card in the IO chassis. The IO timing card is powered and does have what appear to be normal status lights on (except for the fiber link lights). It's getting what I think is the nominal 4V power. The connection to the IO chassis backplane board look ok. So maybe it's just a dead fiber issue?
I do not know what could have been the problem with c1auxex, or if it's related to the fast timing issue.
I just got over here from Downs, where I managed to convince Todd to let me borrow one of their three remaining timing slave boards for c1iscex. I walked down to the X end to replace the board only to discover that the link light on the existing timing board was back! c1iscex was not responding, so I hard rebooted the machine, and everything came up rosy (all green!):
To repeat, I DID NOTHING. The thing was working when I got here. I have no idea when it came back, or how, but it's at least working for the moment. I re-enabled the watchdog for ETMX SUS and it's now damped normally.
I'm going to hold on to the timing card for a couple of days, in case the failure comes back, but we'll need to return it to Downs soon, and probably think about getting some spare backups from Columbia.
Steve was trying to do something to it this morning, but I'm not exactly clear on what it was. Maybe that helped? Steve, can you tell us what you were trying to do this morning?
I was trying to repeat elog 9007 I did only get to line 2 of the Solution by Koji when Ottavia shut down, where I was working. This was all what I did.
I tried all versions of power cycling and debugging this problem known to me, including those suggested in this thread and from a more recent time. I am leaving things as it for the night, will look into this more tomorrow. I've also shutdown the ETMX watchdog for the time being. Looks like this has been down since 24Jun 8am UTC.
I tried a couple of things, but no fundamental improvement of the missing LED light on the timing board.
- The power supply cable to the timing board at c1iscex indicated +12.3V
- I swapped the timing fiber to the new one (orange) in the digital cabinet. It didn't help.
- I swapped the opto-electronic I/F for the timing fiber with the Y-end one. The X-end one worked at Y-end, and Y-end one didn't work at X-end.
- I suspected the timing board itself -> I brought a "spare" timing board from the digital cabinet and tried to swap the board. This didn't help.
- Bring the X-end fiber to C1SUS or C1IOO to see if the fiber is OK or not.
- We checked the opto-electronic I/F is OK
- Try to swap the IO chassis with the Y-end one.
- If this helps, swap the timing board only to see this is the problem or not.
There were a few more flaky things in the Expansion chassis - the IDE connectors don't have "keys" that fix the orientation they should go in, and the whole timing card assembly is kind of difficult and not exactly secure. But for now, things are back to normal it seems.
Steve and I inadvertently discovered that the c1iscey IO chassis doesn't have brackets to secure the cards where the ADC/DAC cables are connected, making them very easy to knock loose. All other IO chassis have these brackets. Pictures of c1iscey and c1lsc IO chassis to compare:
Brackets for the c1iscey IO chassis cards have been installed. Now, I can't unseat the cards by wiggling the ADC or DAC cable.
We found that both of the c1iscey models (c1x05 and c1scy) were unresponsive, and weren't coming back up even after reboot. We then found that the c1iscey IOchassis was actually powered off. Steve's accepts some sort of responsibility, since he was monkeying around down there for some reason. After powerup and reboot, everything is running again.
while I was not doing anything on the machine.
This morning the LSC scripts wheren't running properly. I had to reboot c1iscey, c1iscex, c1lsc, c1asc .
I burtrestored to Monday January 25 at 12:00.
The tip-tile SOS dewhite/AI boards are now connected to the digital system.
I put together a chassis for one of our space DAC -> IDC interface boards (maybe our last?). A new SCSI cable now runs from DAC0 in the c1lsc IO chassis in 1Y3, to the DAC interface chassis in 1Y2.
Two homemade ribbon cables go directly from the IDC outputs of the interface chassis to the 66 pin connectors on the backplane of the Eurocrate. They do not go through the cross-connects, cause cross-connects are stupid. They go to directly to the lower connectors for slots 1 and 3, which are the slots for the SOS DW/AI boards. I had to custom make these cables, or course, and it was only slightly tricky to get the correct pins to line up. I should probably document the cable pin outs.
As reported in a previous log in this thread, I added control logic to the c1ass front-end model for the tip-tilts. I extended it to include TT_CONTROL (model part) for TT3 and TT4 as well, so we're now using all channels of DAC0 in c1lsc for TT control.
I tested all channels by stepping through values in EPICS and reading the monitor and SMA outputs of the DW/AI boards. The channels all line up correctly. A full 32k count output of a DAC channel results in 10V output of the DW/AI boards. All channels checked out, with a full +-10V swing on their output with a full +-32k count swing of the DAC outputs.
We're using SN 1 and 2 of the SOS DW/AI boards (seriously!)
The output channels look ok, and not too noisy.
Tomorrow I'll get new SMA cables to connect the DW/AI outputs to the coil driver boards, and I'll start testing the coil driver outputs.
As a reminder: https://wiki-40m.ligo.caltech.edu/Suspensions/Tip_Tilts_IO
Tomorrow I'll get new SMA cables to connect the DW/AI outputs to the coil driver boards, and I'll start testing the coil driver outputs.
I've found a nice 16 twisted pair cable ~25m long and decided to use it as a port from 1Y3 to clean room cable instead of buying a new long one. I've added a break out board to the coil driver end to monitor outputs.
Full cable path from coil driver to osem input is now ready. I've tested Ch1-4 of the left AI and left coil driver. 15 pin outputs and monitors show voltage that we expect. I've checked voltage on the other side of the cable in the clean room, it is correct. We are ready to test the coils. We need to bake osem cables asap. Hopefully, Bob will start this job tomorrow.
c1lsc FE is up and running.
2) The machine was manually rebooted.
3) c1daf was recompiled and installed, with the problematic piece of code removed.
4) NTP timing was adjusted.
5) Frame Builder was restarted.
6) All models on c1lsc machine were restarted.
Attachment 1 shows the CDS status after the recovery. I wont be trying to run frequency warping immediately, I will first finish implementing the other harmless modules first.
Today, at around 10:30, c1lsc machine froze and stopped responding to ping and ssh after I compiled and restarted c1daf. I think it is due to a large array in one of my codes. The daqd.log file shows the following:
Warning: "Virtual circuit unresponsive"
Source File: ../tcpiiu.cpp line 945
Current Time: Thu Jul 14 2016 22:27:42.102649102
I think the c1lsc FE may need a hard reboot.
c1lsc is up and running, Eric did a manual reboot today.
Warning: "Virtual circuit unresponsive"
Source File: ../tcpiiu.cpp line 945
Current Time: Thu Jul 14 2016 22:27:42.102649102
c1lsc and c1sus are still down. Only ETMX and ETMY are damped
[Mirko / Jenne / Kiwamu]
Just a quick update. All the realtime processes on the c1lsc and c1sus machine didn't run at all.
Somehow the c1xxxfe.ko kernel module, where xxx is x04, x02, lsc, ass, sus, mcs, pem and rfm failed to be insmod.
The timing indicators on the c1lsc and c1sus machine are saying NO SYNC.
- According to log files (target/c1lsc/logs/log.txt)
insmod: error inserting '/opt/rtcds/caltech/c1/target/c1lsc/bin/c1lscfe.ko': -1 Unknown symbol in module
- dmesg on c1lsc (c1sus also dumps the same error message):
[ 45.831507] DXH Adapter 0 : sci_map_segment - Failed to map segment - error=0x40000d01
[ 45.833006] c1x04: DIS segment mapping status 1073745153
DXH dapter is a part of the Dolphine connections.
When a realtime codes is waking up, the code checks the Dolphin connections.
The checking procedure is defined by dolphin.c (/src/fe/doplhin.c).
According to a printk sentence in dolphin.c the second error message listed above will return status "0" if everything is fine.
The first error above is an error vector from a special dolphin's function called sci_map_segment, which is called in dolphin.c.
So something failed in this sci_map_segment function and is preventing the realtime code from waking up.
Note that sci_map_segment is defined in genif.h and genif.c which reside in /opt/srcdis/src/IRM_DSX/drv/src.
[Jenne, Mirko, Kiwamu, Koji, and Jamie by phone]
We just got off the phone with Jamie. In addition to all the stuff that Kiwamu mentioned, Mirko reverted the c1oaf model and C-code to stuff that was working successfully on Friday (using "svn export, rev # 1134) since that's what we were working on when all hell broke loose.
We did a few rounds of "sudo shutdown -h now" on c1lsc and c1sus machines, and pulled the power cords out.
We also switched the c1ioo and c1lsc 1PPS fibers in the fanout chassis, to see if that would fix the problem. Nope. c1ioo is still fine, and c1lsc is still not fine.
Still getting "No Sync".
We're going to call in Alex in the morning, if we can't figure it out soon.
Alex fixed the computers this morning. It was in fact a dolphin problem:
Hi Jenne, figured it out. Even though dxadmin said the Dolphin net was fine, it wasn't. Something happeneed to DIS networkmanager and I had to restart it. It is running on fb:
controls@fb ~ $ ps -e | grep dis 12280 ? 00:00:00 dis_networkmgr
controls@fb ~ $ sudo /etc/init.d/dis_networkmgr restart
Once the restart was done both c1lsc and c1sus nodes were configured correctly, they printed the usual "node 12 is OK" "node 8 is OK" messages into the dmesg and was able to run /etc/start_fes.sh on lsc and sus to load all the FEs. Alex
Some lights on c1lsc were still red: C1:DAQ-FBO_C1SYS_SYS and the smaller red light left of it. Restarted the fb. Didn't help. Restarted c1lsc, all green now.
Restored autoburt from Oct 3. 19:07 on c1lsc and c1sus.
This is my third time to crash a real-time machine. This time I crashed c1lsc.
I physically rebooted c1lsc machine by pushing the power button and it came back and now running fine.
(what I did)
The story is almost the same as the last two times (1st time, 2nd time).
I edited c1lsc.ini file using daqconfig and then shutdown daqd running fb.
Some indicators for c1lsc on the C1_FE_STATUS screen became red. So I hit the 'DAQ reload' button on the C1LSC_GDS_TP screen.
Then c1lsc died and didn't respond to ping.
Why is this happening so frequently now? Last few lines of error log:
I fixed it by running the reboot script.
The c1lsc has been unstable since last night. Its status on the DAQ screen was oscillating from green to red every minute.
Yesterday, I power recycled it. That brought it back but the MC got unclocked and the autolocker could not get engaged. I think it's because the power recycling also turned c1iscaux2 off which occupies the same rack crate.
Killing the autolocker on op340 e restarting didn't work. So I rebooted also c1dcuepis and burt-restored almost all snapshot files. To do that, as usual, I had to edit the snapshot files of c1dcuepics to move the quotes from the last line.
After that I restarted the autolocker that time it worked.
This morning c1lsc was again in the same unstable status as yesterday. This time I just reset it (no power recycling) and then I restarted it. It worked and now everything seems to be fine.
c1lsc was down this morning.
I restarted fb and c1lsc based on elog
Everything but c1oaf came back. I tried to restart c1oaf individually; but it didn't work.
Since the lab-wide computer shutdown last Wednesday, all the realtime models running on c1lsc have been flaky. The error is always the same:
[58477.149254] c1cal: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1daf: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1ass: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1oaf: ADC TIMEOUT 0 10963 19 11027
[58477.149254] c1lsc: ADC TIMEOUT 0 10963 19 11027
[58478.148001] c1x04: timeout 0 1000000
[58479.148017] c1x04: timeout 1 1000000
[58479.148017] c1x04: exiting from fe_code()
This has happened at least 4 times since Wednesday. The reboot script makes recovery easier, but doing it once in 2 days is getting annoying, especially since we are running many things (e.g. ASS) in custom configurations which have to be reloaded each time. I wonder why the problem persists even though I've power-cycled the expansion chassis? I want to try and do some IFO characterization today so I'm going to run the reboot script again but I'll get in touch with J Hanks to see if he has any insight (I don't think there are any logfiles on the FEs anyways that I'll wipe out by doing a reboot). I wonder if this problem is connected to DuoTone? But if so, why is c1lsc the only FE with this problem? c1sus also does not have the DuoTone system set up correctly...
The last time this happened, the problem apparently fixed itself so I still don't have any insight as to what is causing the problem in the first place . Maybe I'll try disabling c1oaf since that's the configuration we've been running in for a few weeks.
The c1lsc computer is running Gentoo off of the fb server. It has been connected to the DAQ network and is handling mx_streams properly (so we're not flooding the network error messages like we used to with c1iscex). It is using the old c1lsc ip address (192.168.113.62). It can ssh'd into.
However, it is not talking properly to the IO chassis. The IO chassis turns on when the computer turns on, but the host interface board in the IO chassis only has 2 red lights on (as opposed to many green lights on the host interface boards in the c1sus, c1ioo, and c1iscex IO chassis). The c1lsc IO processor (called c1x04) doesn't see any ADCs, DACs, or Binary cards. The timing slave is receiving 1PPS and is locked to it, but because the chassis isn't communicating, c1x04 is running off the computer's internal clock, causing it to be several seconds off.
Need to investigate why the computer and chassis are not talking to each other.
The c1sus and c1ioo computers are not talking properly to the frame builder. A reboot of c1iscex fixed the same problem earlier, however, as Kiwamu and Suresh are working in the vacuum, I'm leaving those computers alone for the moment, but a reboot and burt restore probably should be done later today for c1sus and c1ioo
When Evan and I were dithering the BS and ITMY (see his elog), I noticed that c1lsc was acting weird. the IOP was the only one with the blinky heartbeat. The IOP was all green lights, but all the other models had red for the fb connection, as well as the rightmost indicator (I don't know what that one is for). I logged on to c1lsc and ran 'rtcds restart all'. The script didn't get anywhere beyond saying it was beginning to stop the 1st model (sup, the bottom one on the lsc list). Then all of the cpus went white. I can still ping c1lsc, but I can't ssh to it.
I'm not sure what to do here Jamie. Heelp.
Manasa told me that she did things in a different order than her old elog.
(1) ssh'ed to c1lsc and did a remote shutdown / restart,
(2) restarted fb,
(3) restarted the mxstream on c1lsc,
(4) restarted each model individually in some order that I forgot to ask.
However, with the situation as in her "before" screenshot, all that needed to be done was restart the mxstream process on c1lsc.
Anyhow, when I looked at the OAF model, it was complaining of "no sync", so I restarted the model, and it came back up fine. All is well again.
This morning I killed again c1lsc kernel with the new realization of fxlms algorithm. It works fine with gcc compiler during the tests. However, smth forbidden for the kernel is going on. I'll spend some more time on investigatin it. Interesting thing is that I did not even pressed "On" at the OAF MEDM screen to make the code running. c1lsc suspended even before. May be there is some function-name mismatch.
After c1lsc suspention I recomiled back non-working code and rebooted c1lsc. c1sus is also bad after c1lsc reboot as they communicate. I killed x04, lsc, ass, oaf models on the c1lsc computer and sus, mcs, rfm, pem on the c1sus computer. Then I restarted x02 model and restored its burt snapshot from 08:07. After I started all models back and restored their burt snapshots from 08:07. Then I diag reset all started models.
Before starting new fxlms code I've shutted down all the optics so that possible c1lsc suspention would not make them crazy. After reboot I turned the coils back. Everything seems to work fine.
The reason I've killed the c1lsc kernel was the following - when the code starts to run, it initializes some parameters and this takes ~0.2 msec per dof. Now, the old code did nothing with a DOF if C1:OAF-ADAPT_???_ONOFF == OFF. My code still initialized the parameters but then does nothing because no witness channels are given. But it spends 8*0.2 = 1.6 msec for initializing all 8 dof. As the code is called with frequency 2k, this was the reason for crashing. Now I've corrected my code, it compiles, runs and does not kill c1lsc. However, the old code would also kill the kernel if all DOF are filtered. So, when we'll use all 8 DOF, we'll have to split variable initialization.
But this is not the biggest problem. C1OAF model must be corrected, because, as for now, all 8 DOF call the same ADAPT_XFCODE function. As this function uses static variables, they will be all messed up by different DOF signals.
According to Suresh's LSC rack design I rearranged the input channels of the c1lsc model such that the analog signals and the ADC channels are nicely matched.
Also I updated the c1lsc model in the svn with a help from Joe. The picture below is a screen shot of the input channels in the model file after I edited it.
As part of this slow but systematic debugging, I am turning on the c1lsc model overnight to see if the model crashes return.
I noticed that all the models running on C1LSC had crashed when I came in earlier today. I restarted all of them by ssh-ing into C1LSC and running rtcds restart all. The models seem to be running fine now.