The daqd process was dying every minute or so when it couldn't write frame. This was slowing down the network by writing a 2.9G core dump over NFS every minute or so. (In /opt/rtcds/caltech/c1/target/fb/).
The problem was /frames/ was 100% full.
Apparently, when we switched the fb over to Gentoo, we forgot to install crontab and a wiper script.
We will install crontab and get the wiper script installed.
We updated the vacuum control and monitor screens (C0VAC_MONITOR.adl and C0VAC_CONTROL.adl). We also updated the /cvs/cds/caltech/target/c1vac1/Vac.db file.
1) We changed the C1:Vac-TP1_lev channel to C1:Vac-TP1_ala channel, since it now is an alarm readback on the new turbo pump rather than an indication of levitation. The logic on printing the "X" was changed from X is printed on a 1 = ok status) to X is printed on a 0 = problem status. All references within the Vac.db file to C1:Vac-TP1_lev were changed. The medm screens also now are labeled Alarm, instead of Levitating.
2) We changed the text displayed by the CP1 channel (C1:Vac-CP1_mon in Vac.db) from "On" and "Off" to "Cold - On" and "Warm - OFF".
3) We restarted the c1vac1 front end as well as the framebuilder after these changes.
Having determined that Rana (the computer) was having to many issues with testing the new Raid array due to age of the system, we proceeded to test on fb40m.
We brought it down and up several times between 11 and noon. We eventually were able to daisy chain the old raid and the new raid so that fb40m sees both. At this time, the RAID arrays are still daisy chained, but the computer is setup to run on just the original raid, while the full 14 TB array is initialized (16 drives, 1 hot spare, RAID level 5 means 14 TB out of the 16 TB are actually available). We expect this to take a few hours, at which point we will copy the data from the old RAID to the new RAID (which I also expect to take several hours). In the meantime, operations should not be affected. If it is, contact one of us.
This afternoon the alignment script chrashed after returning sysntax errors. We found that the tpman wasn't running on the framebuilder becasue it had probably failed to get restarted in one of the several reboots executed in the morning by Alex and Jo.
Restarting the tpman was then sufficient for the alignment scripts to get back to work.
After poking around for a few minutes several facts became clear:
1) At least one GPIB interface has a hard ethernet connection (and does not currently go through the wireless).
2) The wireless on the laptop works fine, since it can connect to the router.
3) The rest of the martian network cannot talk to the router.
This led to me replugging the ethernet cord back into the wireless router, which at some point in the past had been unplugged. The computers now seem to be happy and can talk to each other.
Apparently the random file system failure on megatron was unrelated to the RFM card (or at least unrelated to the physical card itself, its possible I did something while installing it, however unlikely).
We installed a new hard drive, with a duplicate copy of RTL and assorted code stolen from another computer. We still need to get the host name and a variety of little details straightened out, but it boots and can talk to the internet. For the moment though, megatron thinks its name is scipe11.
You still use ssh megatron.martian to log in though.
We installed the RFM card again, and saw the exact same error as before. "NMI EVENT!" and "System halted due to fatal NMI".
Alex has hypothesized that the interface card the actual RFM card plugs into, and which provides the PCI-X connection might be the wrong type, so he has gone back to Wilson house to look for a new interface card. If that doesn't work out, we'll need to acquire a new RFM card at some point
After removing the RFM card, megatron booted up fine, and had no file system errors. So the previous failure was in fact coincidence.
Last night around 5pm or so, Alex had remotely logged in and made some fixes to megatron.
First, he changed the local name from scipe11 to megatron. There were no changes to the network, this was a purely local change. The name server running on Linux1 is what provides the name to IP conversions. Scipe11 and Megatron both resolve to distinct IPs. Given c1auxex wasn't reported to have any problems (and I didn't see any problems with it yesterday), this was not a source of conflict. Its possible that Megatron could get confused while in that state, but it would not have affected anything outside its box.
Just to be extra secure, I've switched megatron's personal router over from a DMZ setup to only forwarding port 22. I have also disabled the dhcp server on the gateway router (22.214.171.124).
Second, he turned the mdp and mdc codes on. This should not have conflicted with c1omc.
This morning I came in and turned megatron back on around 9:30 and began trying to replicate the problems from last night between c1omc and megatron. I called Alex and we rebooted c1omc while megatron was on, but not running any code, and without any changes to the setup (routers, etc). We were able to burt restore. Then we turned the mdp, mdc and framebuilder codes on, and again rebooted c1omc, which appeared to burt restore as well (I restored from 3 am this morning, which looks reasonable to me).
Finally, I made the changes mentioned above to the router setups in the hope that this will prevent future problems but without being able to replicate the issue I'm not sure.
I called Alex this morning and explained the problems with megatron.
Turns out when he had been setting up megatron, he thought a startup script file, rc.local was missing in the /etc directory. So he created it. However, the rc.local file in the /etc directory is normally just a link to the /etc/rc.d/rc.local file. So on startup (basically when we rebooted the machine yesterday), it was running an incorrect startup script file. The real rc.local includes line:
/usr/bin/setup_shmem.rtl mdp mdc&
Hence the errors we were getting with shm_open(). We changed the file into a soft link, and resourced the rc.local script and mdp started right up. So we're back to where we were 2 nights ago (although we do have an RFM card in hand).
Update: The tst module wouldn't start, but after talking to Alex again, it seems that I need to add the module tst to the /usr/bin/setup_shmem.rtl mdp mdc& line in order for it to have a shared memory location setup for it. I have edited the file (/etc/rc.d/rc.local), adding tst at the end of the line. On reboot and running starttst, the code actually loads, although for the moment, I'm still getting blank white blocks on the medm screens.
Alex checked out the old rts (which he is no longer sure how to compile) from CVS to megatron, to the directory:
In /home/controls/cds/rts/src/include you can find the various h files used. Similarly, /fe has the c files.
In the h files, you can work out the memory offset by noting the primary offset in iscNetDsc40m.h
A line like suscomms.pCoilDriver.extData determines an offset to look for.
0x108000 (from suscomms )
Then pCoilDriver.extData[#] determines a further offset.
sizeof(extData) = 8240 (for the 40m - you need to watch the ifdefs, we were looking at the wrong structure for awhile, which was much smaller).
DSC_CD_PPY is the structure you need to look in to find the final offset to add to get any particular channel you want to look at.
The number for ETMX is 8, ETMY 9 (this is in extData), so the extData offset from 0x108000 for ETMY should be 9 * 82400. These numbers (i.e. 8 =ETMX, 9=ETMY) can be found in losLinux.c in /home/controls/cds/rts/src/fe/40m/. There's a bunch of #ifdef and #endif which define ETMX, ETMY, RMBS, ITM, etc. You're looking for the offset in those.
So for ETMY LSC channel (which is a double) you add 0x108000 (a hex number) + (9 * 82400 + 24) (not in hex, need to convert) to get the final value of 0x11a160 (in hex).
A useful program to interact with the RFM network can be found on fb40m. If you log in and go to:
you can then run rfm2g_util, give it a 3, then type help.
You can use this to read data. Just type help read. We had played around with some offsets and various channels until we were sure we had the offsets right. For example, we fixed an offset into the ETMY LSC input, and saw the corresponding memory location change to that value. This utility may also be useful for when we do the RFM test to check the integrity of the ring, as there are some diagnostic options available inside it.
Alex tooked at the channel definitions (can be seen in tpchn_C1.par), and noticed the rmid was 0.
However, we had set in testpoint.par the tst system to C-node1 instead of C-node0. The final number inf that and the rmid need to be equal. We have changed this, and the test points appear to be working now.
However, the confusing part is in the tst model, the gds_node_id is set to 1. Apparently, the model starts counting at 1, while the code starts counting at 0, so when you edit the testpoint.par file by hand, you have to subtract one from whatever you set in the model.
In other news, Alex pointed me at a CDS_PARTS.mdl, filters, "IIR FM with controls". Its a light green module with 2 inputs and 2 outputs. While the 2nd set of input and outputs look like they connect to ground, they should be iterpreted by the RCG to do the right thing (although Alex wasn't positive it works, it worth trying it and seeing if the 2nd output corresponds to a usable filter on/off switch to connect to the binary I/O to control analog DW. However, I'm not sure it has the sophistication to wait for a zero crossing or anything like that - at the moment, it just looks like a simple on/off switch based on what filters are on/off.
Alex came over with a short RFM cable this morning. We used it to connect the rfm card in c1iscey to the rfm card megatron
Alex renamed startup.cmd in /cvs/cds/caltech/target/c1iscey/ to startup.cmd.sav, so it doesn't come up automatically. At the end we moved it back.
Alex used the vxworks command d to look at memory locations on c1iscey. Such as d 0xf0000000, which is the start of the rfm code location. So to look at 0x11a1c8 (lscPos) in the rfm memory, he typed "d 0xf011a1c8". After doing some poking around, we look at the raw tst front end code (in /home/controls/cds/advLigo/src/fe/tst), and realized it was trying to read doubles. The old rts code uses floats, so the code was reading incorrectly.
As a quick fix, we changed the code to floats for that part. They looked like:
etmy_lsc = filterModuleD(dsp_ptr,dspCoeff,ETMY_LSC,cdsPciModules.pci_rfm? *(\
(double *)(((void *)cdsPciModules.pci_rfm) + 0x11a1c8)) : 0.0,0);
And we simply changed the double to float in each case. In addition we changed the RCG scripts locally as well (if we do a update at some point, it'll get overwritten). The file we updated was /home/controls/cds/advLigo/src/epics/util/lib/RfmIO.pm
Line 57 and Line 84 were changed, with double replaced with float.
return "cdsPciModules.pci_rfm? *((float *)(((void *)cdsPciModules.pci
_rfm[$card_num]) + $rfmAddressString)) : 0.0";
. " *((float *)(((char *)cdsPciModules.pci_rfm[$card_nu
m]) + $rfmAddressString)) = $::fromExp;\n"
This fixed our ability to read the RFM card, which now can read the LSC POS channel, for example.
Unfortunately, when we were putting everything the way it was with RFM fibers and so forth, the c1iscey started to get garbage (all the RFM memory locations were reading ffff). We eventually removed the VME board, removed the RFM card, looked at it, put the RFM card back in a different slot on the board, and returned c1iscey to the rack. After this it started working properly. Its possible in all the plugging and unplugging that the card somehow had become loose.
The next step is to add all the channels that need to be read into the .mdl file, as well as testing and adding the channel which need to be written.
Alex added a new module to the RCG, for generating RFMIO using floats. This has been commited to CVS.
Turns out the CDSO32 part (representing the Contec BO-32L-PE binary output) rquires two inputs. One for the first 16 bits, and one for the second set of 16 bits. So Alex added another input to the part in the library. Its still a bit strange, as it seems the In1 represents the second set of 16 bits, and the In2 represents the first set of 16 bits.
I added two sliders on the CustomAdls/C1TST_ETMY.adl control screen (upper left), along with a bit readout display, which shows the bitwise and of the two slider channels. For the moment, I still can't see any output voltage on any of the DO pins, no matter what output I set.
A few machines have still not been changed over, including a few laptops, mafalda, ottavia, and c0rga.
All the front ends have been changed over.
fb40m died during a reboot and was replaced with a spare Sun blade 1000 that Larry had. We had to swap in our old hard drive and memory.
All the front ends, belladonna, aldabella, and the control room machines have been switched over. Nodus was changed over after we realized we hosed the elog and svn by switching linux1's IP.
At this point, 90% of the machines seem to be working, although c0daqawg seems to be having some issues with its startup.cmd code.
I received an e-mail from Alex indicating he found the testpoint problem and fixed it today:
Quote from Alex: "After we swapped the frame builder computer it has reconfigured all device files and I needed to create some symlinks on /dev/ to make tpman work again. I test the testpoints and they do work now."
The problem with the new models using the new shared memory/dolphin/RFM defined as names in a single .ipc file.
The first is the no_oversampling flag should not be used. Since we have a single IO processor handling ADCs and DACs at 64k, while the models run at 16k, there is some oversampling occuring. This was causing problems syncing between the models and the IOP.
It also didn't help I had a typo in two channels which I happened to use as a test case to confirm they were talking. However, that has been fixed.
A new webview of the LSP model is available at:
This model include a couple example noise generators as well as the new Matrix of Filter banks (5 inputs x 15 outputs = 75 Filters!). The attached png shows where these parts can be found in the CDS_PARTS library. I'm still working on the automatic generation of the matrix and filter bank medm screens for this part. The plan is to have a matrix screen similar to current ones, except that the value entry points to the gain setting of the associated filter. In addition, underneath each value, there will be a link to the full filter bank screen. Ideally, I'd like to have the filter adl files located in a sub-directory of the system, to keep clutter down.
I've cut and past the new Foton file generated by the LSP model below. The first number following the MTRX is the input the filter is taking data from and the second number is the output its pushing data to. This means for the script parsing Valera's transfer functions, I need to input which channel corresponds to which number, such as DARM = 0, MICH = 1, etc. So the next step is to write this script and populate the filter banks in this file.
# FILTERS FOR ONLINE SYSTEM
# Computer generated file: DO NOT EDIT
# MODULES DOF2PD_AS11I DOF2PD_AS11Q DOF2PD_AS55I DOF2PD_AS55Q
# MODULES DOF2PD_ASDC DOF2PD_POP11I DOF2PD_POP11Q DOF2PD_POP55I
# MODULES DOF2PD_POP55Q DOF2PD_POPDC DOF2PD_REFL11I DOF2PD_REFL11Q
# MODULES DOF2PD_REFL55I DOF2PD_REFL55Q DOF2PD_REFLDC Mirror2DOF_f2x1
# MODULES Mirror2DOF_f2x2 Mirror2DOF_f2x3 Mirror2DOF_f2x4 Mirror2DOF_f2x5
# MODULES Mirror2DOF_f2x6 Mirror2DOF_f2x7 DOF2PD_MTRX_0_0 DOF2PD_MTRX_0_1
# MODULES DOF2PD_MTRX_0_2 DOF2PD_MTRX_0_3 DOF2PD_MTRX_0_4 DOF2PD_MTRX_0_5
# MODULES DOF2PD_MTRX_0_6 DOF2PD_MTRX_0_7 DOF2PD_MTRX_0_8 DOF2PD_MTRX_0_9
# MODULES DOF2PD_MTRX_0_10 DOF2PD_MTRX_0_11 DOF2PD_MTRX_0_12 DOF2PD_MTRX_0_13
# MODULES DOF2PD_MTRX_0_14 DOF2PD_MTRX_1_0 DOF2PD_MTRX_1_1 DOF2PD_MTRX_1_2
# MODULES DOF2PD_MTRX_1_3 DOF2PD_MTRX_1_4 DOF2PD_MTRX_1_5 DOF2PD_MTRX_1_6
# MODULES DOF2PD_MTRX_1_7 DOF2PD_MTRX_1_8 DOF2PD_MTRX_1_9 DOF2PD_MTRX_1_10
# MODULES DOF2PD_MTRX_1_11 DOF2PD_MTRX_1_12 DOF2PD_MTRX_1_13 DOF2PD_MTRX_1_14
# MODULES DOF2PD_MTRX_2_0 DOF2PD_MTRX_2_1 DOF2PD_MTRX_2_2 DOF2PD_MTRX_2_3
# MODULES DOF2PD_MTRX_2_4 DOF2PD_MTRX_2_5 DOF2PD_MTRX_2_6 DOF2PD_MTRX_2_7
# MODULES DOF2PD_MTRX_2_8 DOF2PD_MTRX_2_9 DOF2PD_MTRX_2_10 DOF2PD_MTRX_2_11
# MODULES DOF2PD_MTRX_2_12 DOF2PD_MTRX_2_13 DOF2PD_MTRX_2_14 DOF2PD_MTRX_3_0
# MODULES DOF2PD_MTRX_3_1 DOF2PD_MTRX_3_2 DOF2PD_MTRX_3_3 DOF2PD_MTRX_3_4
# MODULES DOF2PD_MTRX_3_5 DOF2PD_MTRX_3_6 DOF2PD_MTRX_3_7 DOF2PD_MTRX_3_8
# MODULES DOF2PD_MTRX_3_9 DOF2PD_MTRX_3_10 DOF2PD_MTRX_3_11 DOF2PD_MTRX_3_12
# MODULES DOF2PD_MTRX_3_13 DOF2PD_MTRX_3_14 DOF2PD_MTRX_4_0 DOF2PD_MTRX_4_1
# MODULES DOF2PD_MTRX_4_2 DOF2PD_MTRX_4_3 DOF2PD_MTRX_4_4 DOF2PD_MTRX_4_5
# MODULES DOF2PD_MTRX_4_6 DOF2PD_MTRX_4_7 DOF2PD_MTRX_4_8 DOF2PD_MTRX_4_9
# MODULES DOF2PD_MTRX_4_10 DOF2PD_MTRX_4_11 DOF2PD_MTRX_4_12 DOF2PD_MTRX_4_13
# MODULES DOF2PD_MTRX_4_14
As noted previously, we were having problems with getting multiple 32 channel binary output cards working. Alex came by and we eventually tracked the problem down to an incorrect counter in the c code. This has been fixed and checked into the CDS svn repository. I tested the actual hardware and we are in fact able to turn our test LEDs on with multiple binary output boards.
Alex and I also looked at the non-functional IO chassis (the one which wouldn't sync with the 1PPS signal and wasn't turning on when the computer turned on. We discovered one corner of the trenton board wasn't screwed down and was in fact slightly warped. I screwed it down properly, straightening the board out in the process. After this, the IO chassis worked with a host interface board to the computer and started properly. We were able to see the boards attached as well with lspci. So that chassis looks to be in working condition now.
Onwards to the RFM test.
Alex came over this morning and we began work on the frame builder change over. This required fb40m be brought down and disconnected from the RAID array, so the frame builder is not available.
He brought a Netgear switch which we've installed at the top of the 1X7 rack. This will eventually be connected, via Cat 6 cable, to all the front ends. It is connected to the new fb machine via a 10G fiber.
Alex has gone back to Downs to pickup a Symmetricon (sp?) card for getting timing information into the frame builder. He will also be bringing back a harddrive with the necessary framebuilder software to be copied onto the new fb machine.
He said he'd like to also put a Gentoo boot server on the machine. This boot server will not affect anything at the moment, but its apparently the style the sites are moving towards. So you have a single boot server, and diskless front end computers, running Gentoo. However for the moment we are sticking with our current Centos real time kernel (which is still compatible with the new frame builder code). However this would make a switch over to the new system possible in the future.
At the moment, the RAID array is doing a file system check, and is going slowly while it checks terabytes of data. We will continue work after lunch.
Punchline: things still don't work.
This is being recorded for posterity so we know where to look for the old controls settings.
The last good burt restore that was saved before turning off scipe25 aka c1dcuepics was on September 29, 11:07.
The centos 5.5 compiled gds code is currently living on rosalba in the /opt/app directory (this is local to Rosalba only). It has not been fully compiled properly yet. It is still missing ezcaread/write/ and so forth. Once we have a fully working code, we'll propagate it to the correct directories on linux1.
So to have a working dtt session with the new front ends, log into rosalba, go to opt/apps/, and source gds-env.bash in /opt/apps (you need to be in bash for this to work, Alex has not made a tcsh environment script yet). This will let get testpoints and be able to make transfer function measurements, for example
Also, to build the latest awgtpman, got to fb, go to /opt/rtcds/caltech/c1/core/advLigoRTS/src/gds, and type make. This has been done and mentioned just as reference.
The awgtpman along with the front end models should startup automatically on reboot of c1sus (courtesy of the /etc/rc.local file).
There currently seems to be a timing issue with the frame builder. We switched over to using a symmetricom card to get an IRIG-B signal into the fb machine, but the gps time stamp is way off (~80 years Alex said).
If there is a frame buiilder issue, its currently often necessary to kill the associated mx_stream processes, since they don't seem to restart gracefully. To fix it the following steps should be taken:
Kill frame builder, kill the two mx_stream processes, then /etc/restart_streams/, then restart the frame builder (usual daqd -c ./daqdrc >& ./daqd.log in /opt/rtcds/caltech/c1/target/fb).
To restart (or start after a boot) the nds server, you need to go to /opt/rtcds/caltech/c1/target/fb and type
At this time, testpoints are kind of working, but timing issues seem to be preventing useful work being done with it. I'm leaving with Alex working on the code.
Alex is installing the newly compiled gds code (compiled on Centos 5.5 on Rosalba) which does in fact include the ezca type tools.
At the moment we don't have a solaris compile, although that should be done at somepoint in the future. It means the gds tools (diaggui, foton, etc) won't work on op440m. On the bright side, this newer gds code has a foton that doesn't seem to crash all the time on Linux.
1) Need to check 1 PPS signal alignment
2) Figure out why 1PPS and ADC/DAC testpoints went away from feCodeGen.pl?
3) Fix 1PPS testpoint giving NaN data
4) Figure out why is daqd printing "making gps time correction" twice?
5) Need to investigate why mx_streams are still getting stuck
6) Epics channels should not go out on 114 network (seen messages when doing
7) Dataviewer leaves test points hanging, daqd does not deallocate them
(net_Writer.c shutdown_netwriter call)
8) Need to install wiper scripts on fb
9) Need to install newer kernel on fb to avoid loading myrinet firmware
(avoid boot delay)
Fb is now once again actually recording trends.
A section of the daqdrc file (located in /opt/rtcds/caltech/c1/target/fb/ directory) had been commented out by Alex and never uncommented. This section included the commands which actually make the fb record trends.
The section now reads as:
# comment out this block to stop saving data
#start frame-writer "126.96.36.199" broadcast="188.8.131.52" all;
Apparently when updating front end codes from rtlinux to the patched Gentoo, certain files don't get deleted when running make clean, such as the sysfe.rtl files in the advLigoRTS/src/fe/sys directories. This fouls the start up scripts by making it think it should be configured for rtlinux rather than the Gentoo kernel module.
Interesting information from Alex. We're limited to 2 Megabytes per second per front end model. Assuming all your channels are running at a 2kHz rate, we can have at most 256 channels being set to the frame builder from the front end (assuming 4 byte data). We're fine for the moment, but perhaps useful to keep in mind.
I talked to Alex this morning and he said the frame builder is being flaky (it crashed on us twice this morning, but the third time seemed to stay up when requesting data). I've added a new wiki page called "New Computer Restart Prodecures" under Computers and Scripts, found here. It includes all the codes that need to be running, and also a start order if things seem to be in a particularly bad state. Unfortunately, there were no fixes done to the frame builder but it is on Alex's list of things to do.
In regards to the timing out of the front ends, Alex came over to the 40m this morning and we sat down debugging. We tried several things, such as removing all filters from C1MCS.txt file in the chans directory, and watching the timing as we pressed various medm control buttons. We traced it to a filters used by the DAC in the model talking to the IOP front end, which actually sends the data to the physical DAC card. The filter is used when converting between sample rates, in this case between the 16 kHz of the front end model and the 64 kHz of the IOP. Sending it raw zeros after having had real data seemed to cause this filter to eat up an usually large amount of CPU time.
We modified the /opt/rtcds/caltech/c1/core/advLigoRTS/src/include/drv/fm10Gen.c file.
We reverted a change that was done between version 908 and 929, where underflows (really small numbers) were dealt with by adding and then subtracting a very small number. We left the adding and subtracting, but also restored the hard limits on the history.
So instead of relying on just:
input += 1e-16;
junk = input;
input -= 1e-16;
we also now use
if((new_hist < 1e-20) && (new_hist > -1e-20)) new_hist = new_hist<0 ? -1e-20: 1e-20;
Thus any filter value who's absolute value is less than 1e-20 will be clamped to -1e-20 or 1e-20. On the bright side, we no longer crash the front ends when we turn something off.
It looks as though we may have two IO chassis with bad timing cards.
Symptoms are as follows:
We can get our front end models writing data and timestamps out on the RFM network.
However, they get rejected on the receiving end because the timestamps don't match up with the receiving front end's timestamp. Once started, the system is consistently off by the same amount. Stopping the front end module on c1ioo and restarting it, generated a new consistent offset. Say off by 29,000 cycles in the first case and on restart we might be 11,000 cycles off. Essentially, on start up, the IOP isn't using the 1PPS signal to determine when to start counting.
We tried swapping the spare IO chassis (intended for the LSC) in ....
# Joe will finish this in 3 days.
# Basically, in conclusion, in a word, we found that c1ioo IO chassis is wrong.
Diagnostic test tools was starting with errors.
After the reboot of the frame builder machine yesterday by Alex, the diagconfd daemon was not getting started by xinetd. There was a sequence error in the startup where xinetd was being called before mounting drives from linux1.
If you do not see the "nds" line you would not have diagnostic tests enabled in the DTT:
[controls@rosalba apps]$ diag -i | grep nds
nds * * 192.168.113.202 8088 * 192.168.113.202
Alex changed /etc/xinetd.d/diagconfd file to point to /opt/apps/gds/bin/diagconfd instead of /opt/apps/bin/diagconf. He also ensured xinetd started after mounting from linux1.
My feeling is we should get rid of this feature and have an NDS address
entry box in the "Online" tab in the DTT with the default "nds". I
mentioned this to Jim Batch and he greed with me, so maybe he is going to
implement this. So maybe you guys want to request the same thing too, send
the request to Rolf and Jim, so we can have the last demon exercised.
The 40m computers were responding sluggishly yesterday, to the point of being unusable.
The mx_stream code running on c1iscex (the X end suspension control computer) went crazy for some reason. It was constantly writing to a log file in /cvs/cds/rtcds/caltech/c1/target/fb/192.168.113.80.log. In the past 24 hours this file had grown to approximately 1 Tb in size. The computer had been turned back on yesterday after having reconnected its IO chassis, which had been moved around last week for testing purposes - specifically plugging the c1ioo IO chassis in to it to confirm it had timing problems.
The mx_stream code was killed on c1iscex and the 1 Tb file removed.
Computers are now more usable.
We still need to investigate exactly what caused the code to start writing to the log file non-stop.
Alex believes this was due to a missing entry in the /diskless/root/etc/hosts file on the fb machine. It didn't list the IP and hostname for the c1iscex machine. I have now added it. c1iscex had been added to the /etc/dhcp/dhcpd.conf file on fb, which is why it was able to boot at all in the first place. With the addition of the automatic start up of mx_streams in the past week by Alex, the code started, but without the correct ip address in the hosts file, it was getting confused about where it was running and constantly writing errors.
When adding a new FE machine, add its IP address and its hostname to the /diskless/root/etc/hosts file on the fb machine.
We couldn't set testpoints on the c1ioo machine.
Awgtpman was getting into some strange race condition. Alex added an additional sleep statement and shifted boot order slightly in the rc.local file. This apparently is only a problem on c1ioo, which is a Sun X4600. It was using up 100% of a single CPU for the awgtpman process.
We now have c1ioo test points working.
Need to examine the startc1ioo script and see if needs a similar modification, as that was tried at several points but yielded a similar state of non-functioning test points. For the moment, reboot of c1ioo is probably the best choice after making modifications to the c1ioo or c1x03 models.
Current CDS status:
The front ends seem to have different gps timestamps on the data than the frame builder has when receiving them.
One theory is we have fairly been doing SVN checkouts of the code for the front ends once a week or every two weeks, but the frame builder has not been rebuilt for about a month.
Alex is currently rebuilding the frame builder with the latest code changes.
It also suggests I should try rebuilding the frame builder on a semi-regular basis as updates come in.
On the fb machine in /etc/dis/ there are several configurations files that need to be set for our dolphin network.
First, we modify networkmanager.conf.
We set "-dimensionX 2;" and leave the dimensionY and dimensionZ as 0. If we had 3 machines on a single router, we'd set X to 3, and so forth.
We then modify dishosts.conf.
We add an entry for each machine that looks like:
#Keyword name nodeid adapter link_width
ADAPTER: c1sus_a0 4 0 4
The nodeids (the first number after the name) increment by 4 each time, so c1lsc is:
ADAPTER: c1lsc_a0 8 0 4
The file cluster.conf is automatically updated by the code by parsing the dishosts.conf and networkmanager.conf files.
We uncommented the following lines in the rc.local file in /diskless/root/etc on the fb machine:
# Initialize Dolphin
# Have to set it first to node 4 with dxconfig or dis_nodemgr fails. Unexplai ned.
/opt/DIS/sbin/dxconfig -c 1 -a 0 -slw 4 -n 4
/opt/DIS/sbin/dis_nodemgr -basedir /opt/DIS
For the moment we left the following lines commented out:
# Wait for Dolphin to initialize on all nodes
We were unsure of the effect of the dolphin_wait script on the front ends without Dolphin cards. It looks like the script it calls waits until there are no dead nodes.
In /etc/conf.d/ on the fb machine we modified the local.start file by uncommenting:
This starts the Dolphin network manager on the fb machine. The fb machine is not using a Dolphin connection, but controls the front end Dolphin connections via ethernet.
The Dolphin network manager can be interacted with by using the dxadmin program (located in /opt/DIS/sbin/ on the fb machine). This is a GUI program so use ssh -X when logging into the fb before use.
Each IOP model (c1x02, c1x04) that runs on a machine using the Dolphin RFM cards needs to have the flag pciRfm=1 set in the configuration box (usually located in the upper left of the model in Simulink). Similarly, the models actually making use of the Dolphin connections should have it set as well. Use the PCIE_SignalName parts from IO_PARTS in the CDS_PARTS.mdl file to send and receive communications via the Dolphin RFM.
The dolphin RFM was not sending data between c1lsc and c1sus.
Dig into the controller.c code located in /opt/rtcds/caltech/c1/core/advLigoRTS/src/fe/. Find this bit of code on line 2173:
2173 #ifdef DOLPHIN_TEST
2174 #ifdef X1X14_CODE
2175 static const target_node = 8; //DIS_TARGET_NODE;
2177 static const target_node = 12; //DIS_TARGET_NODE;
2179 status = init_dolphin(target_node);
Replace it with this bit of code:
2173 #ifdef DOLPHIN_TEST
2174 #ifdef C1X02_CODE
2175 static const target_node = 8; //DIS_TARGET_NODE;
2177 static const target_node = 4; //DIS_TARGET_NODE;
2179 status = init_dolphin(target_node);
Basically this was hard coded for use at the site on their test stands. When starting up, the dolphin adapter would look for a target node to talk to, that could not be itself. So, all the dolphin adapters would normally try to talk to target_node 12, unless it was the X1X14 front end code, which happened to be the one with dolphin node id 12. It would try to talk to node 8.
Unfortunately, in our setup, we only had nodes 4 and 8. Thus, both our codes would try to talk to a nonexistent node 12. This new code has everyone talk to node 4, except the c1x02 process which talks to node 8 (since it is node 4 and can't talk to itself).
I'm told this stuff is going away in the next revision and shouldn't have this hard coded stuff.
Apparently, the only models which should have pciRfm=1 are the IOP models which have a dolphin connection. Front end models that are not IOP models (like c1lsc and c1rfm) should not have this flag set. Otherwise they include the dolphin drivers and causes them and the IOP to refuse to unload when using rmmod.
So pciRfm=1 only in IOP models using Dolphin, everyone else should not have it or should have pciRfm=-1.
The orientation of the Dolphin cards seems to be opposite on c1lsc and c1sus. The wide part is on top on c1lsc and on the bottom on c1sus. This means, the cable is plugged into the left Dolphin port on c1lsc and into the right Dolphin port on c1sus. Otherwise you get a wierd state where you receive but not transmit.
Even after bringing up the fb40m, I was unable to get the front ends to come up, as they would error out with an RFM problem.
We proceeded to reboot everything I could get my hands on, although its likely it was daqawg and daqctrl which were the issue, as on the C0DAQ_DETAIL screen their status had been showing as 0xbad, but after the reboot showed up as 0x0. They had originally come up before the frame builder had been fixed, so this might have been the culprit. In the course of rebooting, I also found c1omc and c1lsc had been turned off as well, and turned them on.
After this set of reboots, we're now able to bring the front ends up one by one.
As noted by Koji, Alex and Rolf stopped by.
We discussed the feasibility of getting multiple models using the same DAC. We decided that we infact did need it. (I.e. 8 optics through 3 DACs does not divide nicely), and went about changing the controller.c file so as to gracefully handle that case. Basically it now writes a 0 to the channel rather than repeating the last output if a particular model goes down that is sharing a DAC.
In a separate issue, we found that when skipping DACs in a model (say using DACs 1 and 2 only) there was a miscommunication to the IOP, resulting in the wrong DACs getting the data. the temporary solution is to have all DACs in each model, even if they are not used. This will eventually be fixed in code.
At this point, we *seem* to be able to control and damp optics. Look for a elog from Yuta confirming or denying this later tonight (or maybe tomorrow).
The framebuilder was being flaky. MX_streams would go down, prevent testpoints from working and so forth.
Send Alex up North for a week to fix the code.
Alex came back and installed updates to the frame builder and the mx_streams code (Myrinet Express over Generic Ethernet Hardware) used by the front ends to talk to the frame builder. Instead of 1 stream per model, there's now just 1 per front end handling all communications.
Alex did an SVN update and we now have the latest CDS code.
Self restarting codes:
The frame builder code (daqd) and nds pipe have been added to the fb machine's inittab. Specifically it calls a script called /opt/rtcds/caltech/c1/target/fb/start_daqd.inittab and /opt/rtcds/caltech/c1/target/fb/start_nds.inittab.
The addition to the /etc/inittab file on fb is:
When these codes die they should automatically restart.
Self starting codes at boot up:
The front ends now start the mx_stream script (which lives in /opt/rtcds/caltech/c1/target/fb/ directory) at boot up. They call it with the approriate command line options for that front end. It can be found in the /etc/rc.local file.
They look like: mx_stream -s "c1x02 c1sus c1mcs c1rms" -d fb:0
As always, the front end codes to be started are defined in the /etc/rtsystab file (or on fb, in the /diskless/root/etc/rtsystab file).
However, if it does go down you would need to restart it manually, although it seems more robust now and doesn't seem to go down every time we restart the frame builder.
All the usual front end IOCs and modules should be started and loaded on boot up as well.
We had timing problems across the front ends.
Noticing that the 1PPS reference was not blinking on the Master Timing Distribution box. It was supposed to be getting a signal from the c0dcu1 VME crate computer, but this was not happening.
We disconnected the timing signal going into c0dcu1, coming from c0daqctrl, and connected the 1PPS directly from c0daqctrl to the Ref In for the Master Timing distribution box (blue box with lots of fibers coming out of it in 1X5).
We now have agreement in timing between front ends.
After several reboots we now have working RFM again, along with computers who agree on the current GPS time along with the frame builder.
RFM is back and testpoints should be happy.
We still don't have a working binary output for the X end. I may need to get a replacement backplane with more than 4 slots if the 1st slot of this board has the same problem as the large boards.
I have burt restored the c1ioo, c1mcs, c1rms, c1sus, and c1scx processes, and optics look to be damped.
Alex and Rolf came over today with a Tempus LX GPS network timing server. This has an IRIG-B output and a 1PPS output. It can also be setup to act as an NTP server (although we did not set that up).
This was placed at waist height in the 1X7 rack. We took the cable running to the presumably roof mounted antenna from the VME timing board and connected it to this new timing server. We also moved the source of the 1PPS signal going to the master timer sequencer (big blue box in 1X7 with fibers going to all the front ends) to this new time server. This system is currently working, although it took about 5 minutes to actually acquire a timing signal from the GPS satellites. Alex says this system should be more stable, with no time jumps.
I asked Rolf about the new timing system for the front ends, he had no idea when that hardware would be available to the 40m.
Currently, all the front ends and the frame builder agree on the time. Front ends are running so the 1 PPS signal appears to be working as well.
While Alex and Rolf were visiting, I pointed out that the Dolphin card was not sending any data, not even a time stamp, from the c1lsc machine.
After some poking around, we realized the IOP (input/output processor) was coming up before the Dolphin driver had even finished loading.
We uncommented the line
in the /diskless/root/etc/rc.local file on the frame builder. This waits until the dolphin module is fully loaded, so it can hand off a correct pointer to the memory location that the Dolphin card reads and writes to. Previously, the IOP had been receiving a bad pointer since the Dolphin driver had not finished loading.
So now the c1lsc machine can communicate with c1sus via Dolphin and from there the rest of the network via the traditional Ge Fanuc RFM.
From what I understand, Alex rewrote portions of the framebuilder and testpoint codes and then recompiled them in order to get more than 1 testpoint per front end working. I've tested up to 5 testpoints at once so far, and it worked.
We also have a new noise component added to the RCG code. This piece of code uses the random number generator from chapter 7.1 of Numerical Recipies Third Edition to generate uniform numbers from 0 to 1. By placing a filter bank after it should give us sufficient flexibility in generating the necessary noise types. We did a coherence test between two instances of this noise piece, and they looked pretty incoherent. Valera will add a picture of it when it finishe 1000 averages to this elog.
I'm in the process of propagating the old suspension control filters to the new RCG filter banks to give us a starting point. Tomorrow Valera and I are planning to choose a subset of the plant filters and put them in, and then work out some initial control filters to correspond to the plant. I also need to think about adding the anti-aliasing filters and whitening/dewhitening filters.
After the crane training, Bob attached speakers to the ceiling right next to the projector, for use with presentations.
[Joe, Jamie, Alex]
I asked Alex which cron to use (dcron? frcron?). He promptly did the following:
rc-update add dcron default
Copied the wiper.pl script from LLO to /opt/rtcds/caltech/c1/target/fb/
At that point, I modified wiper.pl script to reduce to 95% instead of 99.7%.
I added controls to the cron group on fb:
sudo gpasswd -a controls cron
I then added the wiper.pl to the crontab as the following line using crontab -e.
0 6 * * * /opt/rtcds/caltech/c1/target/fb/wiper.pl --delete &> /opt/rtcds/caltech/c1/target/fb/wiper.log
Note, placing backups on the /frames raid array will break this script, because it compares the amount in the /frames/full/, /frames/trends/minutes, and /frames/trends/seconds to the total capacity.
Apparently, we had backups from September 27th, 2010 and March 22nd, 2011. These would have broken the script in any case.
We are currently removing these backups, as they are redundant data, and we have rsync'd backups of the frames and trends. We should now have approximately twice the lookback of full frames.
We are now using the LIGO CDS SVN for storing our control models.
The SVN is at:
The models are under cds_user_apps, then trunk, then approriate subsystem (ISC for c1lsc for example), c1 (for caltech 40m), then models.
We have checked out /cds_user_apps to /opt/rtcds/.
So to find the c1lsc.mdl model, you would go to /opt/rtcds/cds_user_apps/trunk/ISC/c1/models/c1lsc.mdl
This SVN is shared by many people LIGO, so please follow good SVN practice. Remember to update models ("svn update") before doing commits. Also, after making changes please do an update to the SVN so we have a record of the changes.
We are creating soft links in the /opt/rtcds/caltech/c1/core/advLigoRTS/src/epics/simLink/ to the models that you need to build. So if you want to add a new model, please add it to the cds_users_apps SVN in the correct place and create a soft link to the simLink directory.
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1sus.mdl -> /opt/rtcds/cds_user_apps/trunk/SUS/c1/models/c1sus.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1sup.mdl -> /opt/rtcds/cds_user_apps/trunk/SUS/c1/models/c1sup.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1spy.mdl -> /opt/rtcds/cds_user_apps/trunk/SUS/c1/models/c1spy.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1spx.mdl -> /opt/rtcds/cds_user_apps/trunk/SUS/c1/models/c1spx.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1scy.mdl -> /opt/rtcds/cds_user_apps/trunk/SUS/c1/models/c1scy.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1scx.mdl -> /opt/rtcds/cds_user_apps/trunk/SUS/c1/models/c1scx.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1mcs.mdl -> /opt/rtcds/cds_user_apps/trunk/SUS/c1/models/c1mcs.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1x05.mdl -> /opt/rtcds/cds_user_apps/trunk/CDS/c1/models/c1x05.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1x04.mdl -> /opt/rtcds/cds_user_apps/trunk/CDS/c1/models/c1x04.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1x03.mdl -> /opt/rtcds/cds_user_apps/trunk/CDS/c1/models/c1x03.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1x02.mdl -> /opt/rtcds/cds_user_apps/trunk/CDS/c1/models/c1x02.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1x01.mdl -> /opt/rtcds/cds_user_apps/trunk/CDS/c1/models/c1x01.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1rfm.mdl -> /opt/rtcds/cds_user_apps/trunk/CDS/c1/models/c1rfm.mdl
lrwxrwxrwx 1 controls controls 55 Apr 28 14:41 c1dafi.mdl -> /opt/rtcds/cds_user_apps/trunk/CDS/c1/models/c1dafi.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1pem.mdl -> /opt/rtcds/cds_user_apps/trunk/ISC/c1/models/c1pem.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1mcp.mdl -> /opt/rtcds/cds_user_apps/trunk/ISC/c1/models/c1mcp.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1lsp.mdl -> /opt/rtcds/cds_user_apps/trunk/ISC/c1/models/c1lsp.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1lsc.mdl -> /opt/rtcds/cds_user_apps/trunk/ISC/c1/models/c1lsc.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1ioo.mdl -> /opt/rtcds/cds_user_apps/trunk/ISC/c1/models/c1ioo.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1gpv.mdl -> /opt/rtcds/cds_user_apps/trunk/ISC/c1/models/c1gpv.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1gfd.mdl -> /opt/rtcds/cds_user_apps/trunk/ISC/c1/models/c1gfd.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1gcv.mdl -> /opt/rtcds/cds_user_apps/trunk/ISC/c1/models/c1gcv.mdl
lrwxrwxrwx 1 controls controls 54 Apr 28 14:41 c1ass.mdl -> /opt/rtcds/cds_user_apps/trunk/ISC/c1/models/c1ass.mdl
At around 9:45 the RFM/FB network alarm went off, and I found c1asc, c1lsc, and c1iovme not responding.
I went out to hard restart them, and also c1susvme1 and c1susvme2 after Jenne suggested that.
c1lsc seemed to have a promising come back initially, but not really. I was able to ssh in and run the start command. The green light under c1asc on the RFMNETWORK status page lit, but the reset and CPU usage information is still white, as if its not connected. If I try to load an LSC channel, say like PD5_DC monitor, as a testpoint in DTT it works fine, but the 16 Hz monitor version for EPICs is dead. The fact that we were able to ssh into it means the network is working at least somewhat.
I had to reboot c1asc multiple times (3 times total), waiting a full minute on the last power cycle, before being able to telnet in. Once I was able to get in, I restarted the startup.cmd, which did set the DAQ-STATUS to green for c1asc, but its having the same lack of communication as c1lsc with EPICs.
c1iovme was rebooted, was able to telnet in, and started the startup.cmd. The status light went green, but still no epics updates.
The crate containing c1susvme1 and c1susvme2 was power cycled. We were able to ssh into c1susvme1 and restart it, and it came back fully. Status light, cpu load and channels working. However I c1susvme2 was still having problems, so I power cycled the crate again. This time c1susvme2 came back, status light lit green, and its channels started updating.
At this point, lacking any better ideas, I'm going to do a full reboot, cycling c1dcuepics and proceeding through the restart procedures.
Kiwamu and I went through and looked at the spare channels available near the PSL table and at the ends.
First, I noticed I need another 4 DB37 ADC adapter box, since there's 3 Pentek ADCs there, which I don't think Jay realized.
Anyways, in the IOO chassis that will put in, for the ADC we have a spare 8 channels which comes in the DB37 format. So one option, is build a 8 BNC converter, that plugs into that box.
The other option, is build 4-pin Lemo connectors and go in through the Sander box which currently goes to the 110B ADC, which has some spare channels.
For DAC at the PSL, the IOO chassis will have 8 spare channel DAC channels since there's only 1 Pentek DAC. This would be in a IDC40 cable format, since thats what the blue DAC adapter box takes. A 8 channel DAC box to 40 pin IDC would need to be built.
The ends have 8 spare DAC channels, again 40 pin IDC cable. A box similar to the 8 channel DAC box for the PSL would need to be built.
The ends also have spare 4-pin Lemo capacity. It looked like there were 10 channels or so still unused. So lemo connections would need to be made. There doesn't appear to be any spare 37 DB connectors on the adapter box available, so lemo via the Sander box is the only way.
Joe needs to provide Kiwamu with cabling pin outs.
If Kiwamu makes a couple spares of the 8 BNC to 37DB connector boards, there's a spare 37DB ADC input in the SUS machine we could use up, providing 8 more channels for test use.