Recently, accordian to Gautam, the NDS2 server has been dying on Megatron ~daily or weekly. The prescription is to restart the server.
Also, megatron is running Ubuntu 12 !! Let's decide on a day to upgrade it to a Debian 18ish....word from Rolf is that Scientific Linux is fading out everywhere, so Debian is the new operating system for all conformists.
I just now modified the /etc/rsyncd.conf file as per Dan Kozak's instructions. The old conf file is still there with the file name appended with today's date.
I then enabled the rsync daemon to run on boot using 'enable'. I'll ask Dan to start the file transfers again and see if this works.
controls@nodus|etc> sudo systemctl start rsyncd.service
controls@nodus|etc> sudo systemctl enable rsyncd.service
Created symlink from /etc/systemd/system/multi-user.target.wants/rsyncd.service to /usr/lib/systemd/system/rsyncd.service.
controls@nodus|etc> sudo systemctl status rsyncd.service
● rsyncd.service - fast remote file copy program daemon
Loaded: loaded (/usr/lib/systemd/system/rsyncd.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2020-04-13 16:49:12 PDT; 1min 28s ago
Main PID: 4950 (rsync)
└─4950 /usr/bin/rsync --daemon --no-detach
Apr 13 16:49:12 nodus.martian.113.168.192.in-addr.arpa systemd: Started fast remote file copy program daemon.
Apr 13 16:49:12 nodus.martian.113.168.192.in-addr.arpa systemd: Starting fast remote file copy program daemon...
Now that the old APC Smart-UPS 2200 is no longer in use by the vacuum system, I looked into whether it can be repurposed for the framebuilder machine. Yes, it can. The max power consumption of the framebuilder (a SunFire X4600) is 1.137kW. With fresh batteries, I estimate this UPS can power the framebuilder for >10 min. and possibly as long as 30 min., depending on the exact load.
@Chub/Jordan, this UPS is ready to be moved to rack 1X6/1X7. It just has to be disconnected from the wall outlet. All of the equipment it was previously powering has been moved to the new UPS. I have ordered a replacement battery (APC #RBC43) which is scheduled to arrive 9/09-11.
I heard a rumor about a DAQ problem at the 40m.
To investigate, I tried retrieving data from some channels under C1:SUS-AS1 on the c1sus2 front end. DQ channels worked fine, testpoint channels did not. This pointed to an issue involving the communication with awgtpman. However, AWG excitations did work. So the issue seemed to be specific to the communication between daqd and awgtpman.
daqd logs were complaining of an error in the tpRequest function: error code -3/couldn't create test point handle. (Confusingly, part of the error message was buffered somewhere, and would only print after a subsequent connection to daqd was made.) This message signifies some kind of failure in setting up the RPC connection to awgtpman. A further error string is available from the system to explain the cause of the failure, but daqd does not provide it. So we have to guess...
One of the reasons an RPC connection can fail is if the server name cannot be resolved. Indeed, address lookup for c1sus2 from fb1 was broken:
$ host c1sus2
Host c1sus2 not found: 3(NXDOMAIN)
$ host c1sus2
Host c1sus2 not found: 3(NXDOMAIN)
In /etc/resolv.conf on fb1 there was the following line:
Changing this to search martian got address lookup on fb1 working:
$ host c1sus2
c1sus2.martian has address 192.168.113.87
$ host c1sus2
c1sus2.martian has address 192.168.113.87
But testpoints still could not be retrieved from c1sus2, even after a daqd restart.
In /etc/hosts on fb1 I found the following:
Changing the hardcoded address to the value returned by the nameserver (192.168.113.87) fixed the problem.
It might be even better to remove the hardcoded addresses of front ends from the hosts file, letting DNS function as the sole source of truth. But a full system restart should be performed after such a change, to ensure nothing else is broken by it. I leave that for another time.
[Anchal, Paco, JC]
Thanks Chris for the fix. We are able to access the testpoints now but we started facing another issue this morning, not sure how it is related to what you did.
The steps we have tried to fix this are:
These above steps did not fix the issue. Since we have the testpoints (C1:SUS-ETMX_TRX_OUT & C1:SUS-ETMY_TRY_OUT) for now to monitor the transmission levels, we are going ahead with our upgrade work without resovling this issue. Please let us know if you have any insights.
It looks like the RFM problem started a little after 2am on Saturday morning (attachment 1). It’s subsequent to what I did, but during a time of no apparent activity, either by me or others.
The pattern of errors on c1rfm (attachment 2) looks very much like this one previously reported by Gautam (errors on all IRFM0 ipcs). Maybe the fix described in Koji’s followup will work again (involving hard reboots).
The compiled version of seisBLRMS had been running ~2 weeks without crashing as of last night, when I killed it
so it wouldn't interfere with alignment scripts. I added an EPICS channel C1:DMF-ENABLE, and updated the DMF
executables to check this channel while running. So far it seems to work. When you're running alignment acripts,
simply click the DISABLE button on the C1DMF_MASTER.adl screen, and then re-ENABLE when the scripts finish.
It's not clear why this is necessary though. Theories include the constant disk access is keeping the
framebuilder busy, reducing its ability to deal with ezcademod commands and DMF programs just flooding the
network with so much traffic that ezcademod-related packets run late and get ignored.
Also, for reasons of aesthetics, I changed the data delay from 6 minutes to 5 minutes. We'll see if that's enough.
> Also, for reasons of aesthetics, I changed the data delay from 6 minutes to 5 minutes. We'll see if that's enough.
5 minutes didn't work.
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_TIME = "en_US.ISO8859-1",
LC_MONETARY = "en_US.ISO8859-1",
LC_CTYPE = "en_US.ISO8859-1",
LC_COLLATE = "en_US.ISO8859-1",
LC_MESSAGES = "C",
LC_NUMERIC = "en_US.ISO8859-1",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
>> mcc -v -m -R -nojvm seisBLRMS.m
Warning: Duplicate directory name: /cvs/cds/caltech/apps/linux/matlab/toolbox/local.
Compiler version: 4.6 (R2007a)
Warning: an error occurred while parsing class FilterDesignDialog.AbstractEditor:
Undefined function or variable 'DAStudio.Object'.
> In /cvs/cds/caltech/apps/linux/matlab/toolbox/shared/filterdesignlib/@FilterDesignDialog/@CoeffEditor/schema.p>schema at 9
Warning: an error occurred while parsing class FilterDesignDialog.CoeffEditor:
Invalid superclass handle.
terminate called after throwing an instance of 'ApplicationRedefinedException*'
Abort (core dumped)
"/cvs/cds/caltech/apps/linux/matlab/bin/mcc" -E "/tmp/fileRnU5Qj_31324": Aborted
??? Error executing mcc, return status = 134.
The seisBLRMS has been running on megatron via an open terminal ssh'd into there from allegra with matlab running. This
is because I couldn't get the compiled matlab functionality to work.
Even so, this running script has been dying lately because of some bogus 'NDS' error. So for today I
have set the NDS server for mDV on megatron to be fb40m:8088 instead of nodus.ligo.caltech.edu. If this seems to fix the problem
I will make this permanent by putting in a case statement to check whether or not the mDV'ing machine is a 40m-martian or not.
I compiled seisBLRMS.
The tricks were the following:
(1) Don't add path in a deployed command.
It does not make sense to add paths in a compiled command because it may be moved to anywhere. Moreover, it can cause some weird side effects. Therefore, I enclosed the addpath part of mdv_config.m in a "if ~isdeployed ... end" clause to avoid adding paths when deployed. Instead of adding paths in the code, we have to add paths to necessary files with -I options at the compilation time. This way, mcc will add all the necessary files into the CTF archive.
(2) Add mex files to the CTF archive by -a options.
For some reason, mcc does not add necessary mex files into the CTF archive even though those files are called in the m-file which is being compiled. We have to add those files by -a options.
(3) NDS_GetData() is slow for nodus when compiled.
NDS_GetData(), which is called by get_data() stops for a few minutes when using nodus as an NDS server.
This problem does not happen when not compiled. I don't know the reason. To avoid this, I modified seisBLRMS.m so that when an environmental variable $NDS is defined, it will use an NDS server defined in this variable.
I wrote a Makefile to compile seisBLRMS. You can read the file to see the details of the tricks.
I also wrote a script start_seisBLRMS, which can be found in /cvs/cds/caltech/apps/DMF/compiled_matlab/seisblrms/. To start seisBLRMS, you can just call this script.
At this moment, seisBLRMS is running on megatron. Let's see if it continues to run without crashing.
We found that DMF/ was not an SVN working copy, so I wiped out the SVN version, imported the on-disk copy, moved it to DMFold/ and then checked out the SVN version.
We can delete DMFold/ whenever we are happy with the SVN copy.
compiled using the 'gcc' compiler instead of the 'ANSI C' compiler that is recommended in the README (which, I notice,
is now missing from Ben Johnsons web page!). Let's see how long this runs.
I (think I) restarted DMF. It's on Mafalda, running in matlab (not the complied version which Rana was having trouble with back in the day). To start Matlab, I did "nohup matlab", ran mdv_config, then started seisBLRMS.m running. Since I used nohup, I then closed the terminal window, and am crossing my fingers in hopes that it continues to work. I would have used Screen, but that doesn't seem to work on Mafalda.
Just kidding. That plan didn't work. The new plan: I started a terminal window on Op540, which is ssh-ed into Mafalda, and started up matlab to run seisBLRMS. That window is still open.
Because Unix was being finicky, I had to open an xterm window (xterm -bg green -fg black), and then ssh to mafalda and run matlab there. The symptoms which led to this were that even though in a regular terminal window on Op540, ssh-ed to mafalda, I could access tconvert, I could not make gps.m work in matlab. When Rana ssh-ed from Allegra to Op540 to Mafalda and ran matlab, he could get gps.m to work. So it seems like it was a Unix terminal crazy thing. Anyhow, starting an xterm window on Op540m and ssh-ing to mafalda from there seemed to work.
Hopefully this having a terminal window open and running DMF will be a temporary solution, and we can get the compiled version to work again soon.
I changed the input channels of the DMF recently so that it now uses 3 Guralp channels in addition to the 3 ACC and 1 Ranger.
op440m:seisblrms>diff seisBLRMS-datachans.txt~ seisBLRMS-datachans.txt
The seisBLRMS channels still have the wrong names of IX and EX, but I have chosen to keep them like this so that we have a long trend. When looking at the historical seisBLRMS trend, we just have to remember that all of the sensors have been around the MC since last summer.
The green xterm on op540m which is running the seisBLRMS DMF got stuck somehow ~3 days ago and lost its NDS connection. I closed the matlab session and restarted it. Seismic trends are now back online.
I restarted the seisBLRMS DMF monitor by ssh'ing into mafalda and starting up a matlab session. I also have started a StripTool session on rossa by forwarding the process from op440m.
We need to get the modern EPICS installation onto these linux machines by copying what K. Thorne has done at LLO.
Been non-functional for 3 weeks. Anyone else notice this? images missing since ~Sep 21.
I've re-submitted the Condor job; pages should be back within the hour.
Summary pages will be unavailable today due to LDAS server maintenance. This is unrelated to the issue that Rana reported.
Each morning now, I am going to try to align both arms and lock. Along with that, sometime at towards the end of each week, we should align the OpLevs. This is a good habit that should be practiced more often, not only by me. As for the Y Arm, Yehonathan and I had to adjust the gain to 0.15 in order to stabilize the lock.
Dead again. No outputs for the past month. We really need a cron job to check this out rather than wait for someone to look at the web page.
Max tells us that soem conf files were bad and that he did something and now some pages are being made. But the PEM and MEDM pages are bank. Also the ASC tab looks bogus to me.
Going to the summary pages and looking at 'Today' seems to break it and crash the browser. Other tabs are OK, but 'summary' is our default page.
I've noticed this happening for a couple of days now. Today, I moved the .ini files which define the config for the pages from the old chans/ location into the /users/public_html/detcharsummary/ConfigFiles/ dir. Somehow, we should be maintaining version control of detcharsummary, but I think right now its loose and free.
As discussed at the meeting earlier this week, we will use some old *MOPA* channels for interfacing with the PLL system Jon is setting up. He is going to put a sketch+photos up here shortly, but in the meantime, Koji helped me identify a channel that can be used to tune the temperature of the Lightwave NPRO crystal via front panel BNC input. It is C1:PSL-126MOPA_126CURADJ, and is configured to output between +/-10V, which is exactly what the controller can accept. The conversion factor from EPICS value to volts is currently set to 1 (i.e. EPICS value of +1 corresponds to +1V output from the DAC). With the help of the wiring diagram, we identified pins 3 and 4 on cross-connect #J7 as the differential outputs corresponding to this channel. Not sure if we need to also setup a TTL channel for servo ENABLE/DISABLE, but if so, the wiring diagram should help us identify this as well.
The cable from the DAC to the cross-connect was wrongly labelled. I fixed this now.
Covid 19 motivated me to revive the summary pages. With Alex Urban's help, the infrastructure was modernized, the wiki is now up to date. I ran a test job for 2020 March 17th, just for the IOO tab, and it works, see here. The LDAS rsync of our frames is still catching up, so once that is up, we can start the old condor jobs and have these updated on a more regular basis.
The 40m summary pages have been revived. I've not had to make any manual interventions in the last 5 days, so things seem somewhat stable, but I'm sure there will need to be multiple tweaks made. The primary use of the pages right now are for vacuum, seismic and PSL diagnostics.
The summary pages were in a sad state of disrepair - the daily jobs haven't been running for > 1 month. I only noticed today because Jordan wanted to look at some vacuum trends and I thought summary pages is nice for long term lookback. I rebooted it just now, seems to be running. @Tega, maybe you want to set up some kind of scripted health check that also sends an alert.
The summary pages had failed because of a conda env change. We are dependent on detchar's conda environment setup to run the scripts on the cluster. However, for some reason, when they upgraded to python3.9, they removed the python3.7 env, which was the cause of the original failure of the summary pages a couple of weeks ago. Here is a list of thoughts on how the pipeline can be made better.
Remember that all the files are to be edited on nodus and not on the cluster.
I updated the config file c1pem.ini in /users/public_html/detcharsummary/ConfigFiles, and commited it so I hope it works, but I did not have git push permissions. Does anyone know what is the idea here? Should we do our own personal git clone and modify that way or shoudl we do it with the control account.
Wiki needs to clear out all the outdated information on this workflow.
The changes are to make the y-scales useful. Currently, all of the past seis BLRMS plots are not so useful because the scales have not been set based on the actual signal levels. Let's see if this works, and we can re-evaluate after a few weeks.