ID |
Date |
Author |
Type |
Category |
Subject |
16299
|
Wed Aug 25 18:20:21 2021 |
Jamie | Update | CDS | GPS time on fb1 fixed, dadq writing correct frames again |
I have no idea what happened to the GPS timing on fb1, but it seems like the issue was coincident with the power glitch on Monday.
As was noted by Koji above, the GPS time kernel interface was off by a year, which was causing the frame builder to write out files with the wrong names. fb1 was using DAQD components from the advligorts 3.3 release, which used the old "symmetricom" kernel module for the GPS time. This old module was also known to have issues with time offsets. This issue is remniscent of previous timing issues with the DAQ on fb1.
I noted that a newer version of the advligorts, version 3.4, was available on debian jessie, the system running on fb1. advligorts 3.4 includes a newer version of the GPS time module, renamed gpstime. I checked with Jonathan Hanks that the interfaces did not change between 3.3 and 3.4, and 3.4 was mostly a bug fix and packaging release, so I decided to upgrade the DAQ to get the new components. I therefore did the following
-
updated the archive info in /etc/apt/sources.list.d/cdssoft.list, and added the "jessie-restricted" archive which includes the mx packages: https://git.ligo.org/cds-packaging/docs/-/wikis/home
-
removed the symmetricom module from the kernel
sudo rmmod symmetricom
-
upgraded the advligorts-daqd components (NOTE I did not upgrade the rest of the system, although there are outstanding security upgrades needed):
sudo apt install advligorts-daqd advligorts-daqd-dc-mx
-
loaded the new gpstime module and checked that the GPS time was correct:
sudo modprobe gpstime
-
restarted all the daqd processes
sudo systemctl restart daqd_*
Everything came up fine at that point, and I checked that the correct frames were being written out. |
16300
|
Thu Aug 26 10:10:44 2021 |
Paco | Update | CDS | FB is writing the frames with a year old date |
[paco, ]
We went over the X end to check what was going on with the TRX signal. We spotted the ground terminal coming from the QPD is loosely touching the handle of one of the computers on the rack. When we detached it completely from the rack the noise was gone (attachment 1).
We taped this terminal so it doesn't touch anything accidently. We don't know if this is the best solution since it is probably needs a stable voltage reference. In the Y end those ground terminals are connected to the same point on the rack. The other ground terminals in the X end are just cut.
We also took the PSD of these channels (attachment 2). The noise seem to be gone but TRX is still a bit noisier than TRY. Maybe we should setup a proper ground for the X arm QPD?
We saw that the X end station ALS laser was off. We turned it on and also the crystal oven and reenabled the temperature controller. Green light immidiately appeared. We are now working to restore the ALS lock. After running XARM ASS we were unable to lock the green laser so we went to the XEND and moved the piezo X ALS alignment mirrors until we maximized the transmission in the right mode. We then locked the ALS beams on both arms successfully. It very well could be that the PZT offsets were reset by the power glitch. The XARM ALS still needs some tweaking, its level is ~ 25% of what it was before the power glitch. |
Attachment 1: Screenshot_from_2021-08-26_10-09-50.png
|
|
Attachment 2: TRXTRY_Spectra.pdf
|
|
16302
|
Thu Aug 26 10:30:14 2021 |
Jamie | Configuration | CDS | front end time synchronization fixed? |
I've been looking at why the front end NTP time synchronization did not seem to be working. I think it might not have been working because the NTP server the front ends were point to, fb1, was not actually responding to synchronization requests.
I cleaned up some things on fb1 and the front ends, which I think unstuck things.
On fb1:
- stopped/disabled the default client (systemd-timesyncd), and properly installed the full NTP server (ntp)
- the ntp server package for debian jessie is old-style sysVinit, not systemd. In order to make it more integrated I copied the auto-generated service file to /etc/systemd/system/ntp.service, and added and "[install]" section that specifies that it should be available during the default "multi-user.target".
- "enabled" the new service to auto-start at boot ("sudo systemctl enable ntp.service")
- made sure ntp was configured to serve the front end network ('broadcast 192.168.123.255') and then restarted the server ("sudo systemctl restart ntp.service")
For the front ends:
- on fb1 I chroot'd into the front-end diskless root (/diskless/root) and manually specifed that systemd-timesyncd should start on boot by creating a symlink to the timesyncd service in the multi-user.target directory:
$ sudo chroot /diskless/root
$ cd /etc/systemd/system/multi-user.target.wants
$ ln -s /lib/systemd/system/systemd-timesyncd.service
- on the front end itself (c1iscex as a test) I did a "systemctl daemon-reload" to force it to reload the systemd config, and then restarted the client ("systemctl restart systemd-timesyncd")
- checked the NTP synchronization with timedatectl:
controls@c1iscex:~ 0$ timedatectl
Local time: Thu 2021-08-26 11:35:10 PDT
Universal time: Thu 2021-08-26 18:35:10 UTC
RTC time: Thu 2021-08-26 18:35:10
Time zone: America/Los_Angeles (PDT, -0700)
NTP enabled: yes
NTP synchronized: yes
RTC in local TZ: no
DST active: yes
Last DST change: DST began at
Sun 2021-03-14 01:59:59 PST
Sun 2021-03-14 03:00:00 PDT
Next DST change: DST ends (the clock jumps one hour backwards) at
Sun 2021-11-07 01:59:59 PDT
Sun 2021-11-07 01:00:00 PST
controls@c1iscex:~ 0$
Note that it is now reporting "NTP enabled: yes" (the service is enabled to start at boot) and "NTP synchronized: yes" (synchronization is happening), neither of which it was reporting previously. I also note that the systemd-timesyncd client service is now loaded and enabled, is no longer reporting that it is in an "Idle" state and is in fact reporting that it synchronized to the proper server, and it is logging updates:
controls@c1iscex:~ 0$ sudo systemctl status systemd-timesyncd
â— systemd-timesyncd.service - Network Time Synchronization
Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled)
Active: active (running) since Thu 2021-08-26 10:20:11 PDT; 1h 22min ago
Docs: man:systemd-timesyncd.service(8)
Main PID: 2918 (systemd-timesyn)
Status: "Using Time Server 192.168.113.201:123 (ntpserver)."
CGroup: /system.slice/systemd-timesyncd.service
└─2918 /lib/systemd/systemd-timesyncd
Aug 26 10:20:11 c1iscex systemd[1]: Started Network Time Synchronization.
Aug 26 10:20:11 c1iscex systemd-timesyncd[2918]: Using NTP server 192.168.113.201:123 (ntpserver).
Aug 26 10:20:11 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 64s/+0.000s/0.000s/0.000s/+26ppm
Aug 26 10:21:15 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 128s/-0.000s/0.000s/0.000s/+25ppm
Aug 26 10:23:23 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 256s/+0.001s/0.000s/0.000s/+26ppm
Aug 26 10:27:40 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 512s/+0.003s/0.000s/0.001s/+29ppm
Aug 26 10:36:12 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 1024s/+0.008s/0.000s/0.003s/+33ppm
Aug 26 10:53:16 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 2048s/-0.026s/0.000s/0.010s/+27ppm
Aug 26 11:27:24 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 2048s/+0.009s/0.000s/0.011s/+29ppm
controls@c1iscex:~ 0$
So I think this means everything is working.
I then went ahead and reloaded and restarted the timesyncd services on the rest of the front ends.
We still need to confirm that everything comes up properly the next time we have an opportunity to reboot fb1 and the front ends (or the opportunity is forced upon us).
There was speculation that the NTP clients on the front ends (systemd-timesyncd) would not work on a read-only filesystem, but this doesn't seem to be true. You can't trust everything you read on the internet. |
16309
|
Thu Sep 2 19:47:38 2021 |
Koji | Update | CDS | This week's FB1 GPS Timing Issue Solved |
After the reboot daqd_dc was not working, but manual starting of open-mx / mx services solved the issue.
sudo systemctl start open-mx.service
sudo systemctl start mx.service
sudo systemctl start daqd_*
|
16310
|
Thu Sep 2 20:44:18 2021 |
Koji | Update | CDS | Chiara DHCP restarted |
We had the issue of the RT machines rebooting. Once we hooked up the display on c1iscex, it turned out that the IP was not given at it's booting-up.
I went to chiara and confirmed that the DHCP service was not running
~>sudo service isc-dhcp-server status
[sudo] password for controls:
isc-dhcp-server stop/waiting
So the DHCP service was manually restarted
~>sudo service isc-dhcp-server start
isc-dhcp-server start/running, process 24502
~>sudo service isc-dhcp-server status
isc-dhcp-server start/running, process 24502
|
16311
|
Thu Sep 2 20:47:19 2021 |
Koji | Update | CDS | Chiara DHCP restarted |
[Paco, Tega, Koji]
Once chiara's DHCP is back, things got much more straight forward.
c1iscex and c1iscey were rebooted and the IOPs were launched without any hesitation.
Paco ran rebootC1LSC.sh and for the first time in this year we had the launch of the processes without any issue. |
16321
|
Mon Sep 13 14:32:25 2021 |
Yehonathan | Update | CDS | c1auxey assembly |
So we agreed that the RTNs points on the c1auxex Acromag chassis should just be grounded to the local Acromag ground as it just needs a stable reference. Normally, the RTNs are not connected to any ground so there is should be no danger of forming ground loops by doing that. It is probably best to use the common wire from the 15V power supplies since it also powers the VME crate. I took the spectra of the ETMX OSEMs (attachment) for reference and proceeding with the grounding work.
|
Attachment 1: ETMX_OSEMS_Noise.png
|
|
16325
|
Tue Sep 14 15:57:05 2021 |
jamie | Frogs | CDS | fb1 /var full after reboot, caused all sorts of problems |
/var on fb1 filled up today, which caused all sorts of CDS issues. I found out about the problem by reading the logs of the services that were having trouble running, in which they complained about not being able to write to disk. I looked at the filesystem status with 'df' and noticed that /var was full, which is where applications write temporary data, and will always cause problems if it's full.
I tracked the issue down to multiple multi-gigabyte log files: /var/log/messages and /var/log/messages.1. They were full of lines like this one:
Aug 29 06:25:21 fb1 kernel: l called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl ca
Seems like something related to the gpstime kernel module?
Anyway, I deleted the log files for now, which cleared up the space on /var. Things should be back to normal now, until the logs fill up again... |
16327
|
Tue Sep 14 16:44:54 2021 |
jamie | Frogs | CDS | fb1 /var full after reboot, caused all sorts of problems |
Jonathan Hanks pointed me to this fix to the gpstime kernel module that was unfortunately put in after the 3.4 release that we're currently using:
https://git.ligo.org/cds/advligorts/-/commit/6f6d6e2eb1d3355d0cbfe9fe31ea3b59af1e7348
I hacked the source in place (/usr/src/gpstime-3.4/drv/gpstime/gpstime.c) to get the fix, and then rebuilt the kernel module with dkms :
sudo dkms uninstall gpstime/3.4
sudo dkms install gpstime/3.4
I then stopped daqd_dc, unloaded gpstime, reloaded it, restarted daqd_dc. The messages are no longer showing up in /var/log/messages, so I think we're ok for the moment.
NOTE: the fix will be undone if we for some reason reinstall the advligorts-gpstime-dkms package. There shouldn't be a need to do that, but we should be aware. I'm discussing with Jonathan if we want to try to push out a new debian package to fix this issue... |
16330
|
Tue Sep 14 17:22:21 2021 |
Anchal | Update | CDS | Added temp sensor channels to DAQ list |
[Tega, Paco, Anchal]
We attempted to reboot fb1 daqd today to get the new temperature sensor channels recording. However, the FE models got stuck, apparantely due to reasons explaine din 40m/16325. Jamie cleared the /var/logs in fb1 so that FE can reboot. We were able to reboot the FE machines after this work successfully and get the models running too. During the day, the FE machines were shut down manually and brought back on manually, a couple of times on the c1iscex machine. Only change in fb1 is in the /opt/rtcds/caltech/c1/chans/daq/C0EDCU.ini where the new channels were added, and some hacking was done by Jamie in gpstime module (See 40m/16327). |
16332
|
Wed Sep 15 11:27:50 2021 |
Yehonathan | Update | CDS | c1auxey assembly |
{Yehonathan, Paco}
We turned off the ETMX watchdogs and OpLevs. We went to the X end and shut down the Acromag chassi. We labeled the chassi feedthroughs and disconnected all the cables from it.
We took it out and tied the common wire of the power supplies (the commons of the 20V and 15V power supplies were shorted so there is no difference which we connect) to the RTNs of the analog inputs.
The chassi was put back in place. All the cables were reconnected. Power turn on.
We rebooted c1auxex and the channels went back online. We turned on the watchdogs and watched the ETMX motion get damped. We turned on the OpLev. We waited until the beam position got centered on the ETMX.
Attachment shows a comparison between the OSEM spectra before and after the grounding work. Seems like there is no change.
We were able to lock the arms with no issues.
|
Attachment 1: c1auxex_Grounding_OSEM_comparison1.pdf
|
|
Attachment 2: c1auxex_Grounding_OSEM_comparison2.pdf
|
|
16351
|
Tue Sep 21 11:09:34 2021 |
Anchal | Summary | CDS | XARM YARM UGF Servo and Oscillators added |
I've updated the c1LSC simulink model to add the so-called UGF servos in the XARM and YARM single arm loops as well. These were earlier present in DARM, CARM, MICH and PRCL loops only. The UGF servo themselves serves a larger purpose but we won't be using that. What we have access to now is to add an oscillator in the single arm and get realtime demodulated signal before and after the addition of the oscillator. This would allow us to get the open loop transfer function and its uncertaintiy at particular frequencies (set by the oscillator) and would allow us to create a noise budget on the calibration error of these transfer functions.
The new model has been committed locally in the 40m/RTCDSmodels git repo. I do not have rights to push to the remote in git.ligo. The model builds, installs and starts correctly. |
16354
|
Wed Sep 22 12:40:04 2021 |
Anchal | Summary | CDS | XARM YARM UGF Servo and Oscillators shifted to OAF |
To reduce burden on c1lsc, I've shifted the added UGF block to to c1oaf model. c1lsc had to be modified to allow addition of an oscillator in the XARm and YARM control loops and take out test points before and after the addition to c1oaf through shared memory IPC to do realtime demodulation in c1oaf model.
The new models built and installed successfully and I've been able to recover both single arm locks after restarting the computers.
|
16365
|
Wed Sep 29 17:10:09 2021 |
Anchal | Summary | CDS | c1teststand problems summary |
[anchal, ian]
We went and collected some information for the overlords to fix the c1teststand DAQ network issue.
- from c1teststand, c1bhd and c1sus2 computers were not accessible through ssh. (No route to host). So we restarted both the computers (the I/O chassis were ON).
- After the computers restarted, we were able to ssh into c1bhd and c1sus, ad we ran rtcds start c1x06 and rtcds start c1x07.
- The first page in attachment shows the screenshot of GDS_TP screens of the IOP models after this step.
- Then we started teh user models by running rtcds start c1bhd and rtcds start c1su2.
- The second page shows the screenshot of GDS_TP screens. You can notice that DAQ status is red in all the screens and the DC statuses are blank.
- So we checked if daqd_ services are running in the fb computer. They were not. So we started them all by sudo systemctl start daqd_*.
- Third page shows the status of all services after this step. the daqd_dc.service remained at failed state.
- open-mx_stream.service was not even loaded in fb. We started it by running sudo systemctl start open-mx_stream.service.
- The fourth page shows the status of this service. It started without any errors.
- However, when we went to check the status of mx_stream.service in c1bhd and c1sus2, they were not loaded and we we tried to start them, they showed failed state and kept trying to start every 3 seconds without success. (See page 5 and 6).
- Finally, we also took a screenshot of timedatectl command output on the three computers fb, c1bhd, and c1sus2 to show that their times were not synced at all.
- The ntp service is running on fb but it probably does not have access to any of the servers it is following.
- The timesyncd on c1bhd and c1sus2 (FE machines) is also running but showing status 'Idle' which suggested they are unable to find the ntp signal from fb.
- I believe this issue is similar to what jamie ficed in the fb1 on martian network in 40m/16302. Since the fb on c1teststand network was cloned before this fix, it might have this dysfunctional ntp as well.
We would try to get internet access to c1teststand soon. Meanwhile, someone with more experience and knowledge should look into this situation and try to fix it. We need to test the c1teststand within few weeks now. |
Attachment 1: c1teststand_issues_summary.pdf
|
|
16367
|
Thu Sep 30 14:09:37 2021 |
Anchal | Summary | CDS | New way to ssh into c1teststand |
Late elog, original time Wed Sep 29 14:09:59 2021
We opened a new port (22220) in the router to the martian subnetwork which is forwarded to port 22 on c1teststand (192.168.113.245) allowing direct ssh access to c1teststand computer from the outside world using:
Checkout this wiki page for unredadcted info. |
16372
|
Mon Oct 4 11:05:44 2021 |
Anchal | Summary | CDS | c1teststand problems summary |
[Anchal, Paco]
We tried to fix the ntp synchronization in c1teststand today by repeating the steps listed in 40m/16302. Even though teh cloned fb1 now has the exact same package version, conf & service files, and status, the FE machines (c1bhd and c1sus2) fail to sync to the time. the timedatectl shows the same stauts 'Idle'. We also, dug bit deeper into the error messages of daq_dc on cloned fb1 and mx_stream on FE machines and have some error messages to report here.
Attempt on fixing the ntp
- We copied the ntp package version 1:4.2.6 deb file from /var/cache/apt/archives/ntp_1%3a4.2.6.p5+dfsg-7+deb8u3_amd64.deb on the martian fb1 to the cloned fb1 and ran.
controls@fb1:~ 0$ sudo dbpg -i ntp_1%3a4.2.6.p5+dfsg-7+deb8u3_amd64.deb
- We got error messages about missing dependencies of libopts25 and libssl1.1. We downloaded oldoldstable jessie versions of these packages from here and here. We ensured that these versions are higher than the required versions for ntp. We installed them with:
controls@fb1:~ 0$ sudo dbpg -i libopts25_5.18.12-3_amd64.deb
controls@fb1:~ 0$ sudo dbpg -i libssl1.1_1.1.0l-1~deb9u4_amd64.deb
- Then we installed the ntp package as described above. It asked us if we want to keep the configuration file, we pressed Y.
- However, we decided to make the configuration and service files exactly same as martian fb1 to make it same in cloned fb1. We copied /etc/ntp.conf and /etc/systemd/system/ntp.service files from martian fb1 to cloned fb1 in the same positions. Then we enabled ntp, reloaded the daemon, and restarted ntp service:
controls@fb1:~ 0$ sudo systemctl enable ntp
controls@fb1:~ 0$ sudo systemctl daemon-reload
controls@fb1:~ 0$ sudo systemctl restart ntp
- But ofcourse, since fb1 doesn't have internet access, we got some errors in status of the ntp.service:
controls@fb1:~ 0$ sudo systemctl status ntp
● ntp.service - NTP daemon (custom service)
Loaded: loaded (/etc/systemd/system/ntp.service; enabled)
Active: active (running) since Mon 2021-10-04 17:12:58 UTC; 1h 15min ago
Main PID: 26807 (code=exited, status=0/SUCCESS)
CGroup: /system.slice/ntp.service
├─30408 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:107
└─30525 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:107
Oct 04 17:48:42 fb1 ntpd_intres[30525]: host name not found: 2.debian.pool.ntp.org
Oct 04 17:48:52 fb1 ntpd_intres[30525]: host name not found: 3.debian.pool.ntp.org
Oct 04 18:05:05 fb1 ntpd_intres[30525]: host name not found: 0.debian.pool.ntp.org
Oct 04 18:05:15 fb1 ntpd_intres[30525]: host name not found: 1.debian.pool.ntp.org
Oct 04 18:05:25 fb1 ntpd_intres[30525]: host name not found: 2.debian.pool.ntp.org
Oct 04 18:05:35 fb1 ntpd_intres[30525]: host name not found: 3.debian.pool.ntp.org
Oct 04 18:21:48 fb1 ntpd_intres[30525]: host name not found: 0.debian.pool.ntp.org
Oct 04 18:21:58 fb1 ntpd_intres[30525]: host name not found: 1.debian.pool.ntp.org
Oct 04 18:22:08 fb1 ntpd_intres[30525]: host name not found: 2.debian.pool.ntp.org
Oct 04 18:22:18 fb1 ntpd_intres[30525]: host name not found: 3.debian.pool.ntp.org
- But the ntpq command is giving the saem output as given by ntpq comman in martian fb1 (except for the source servers), that the broadcasting is happening in the same manner:
controls@fb1:~ 0$ ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
192.168.123.255 .BCST. 16 u - 64 0 0.000 0.000 0.000
- On the FE machines side though, the systemd-timesyncd are still unable to read the time signal from fb1 and show the status as idle:
controls@c1bhd:~ 3$ timedatectl
Local time: Mon 2021-10-04 18:34:38 UTC
Universal time: Mon 2021-10-04 18:34:38 UTC
RTC time: Mon 2021-10-04 18:34:38
Time zone: Etc/UTC (UTC, +0000)
NTP enabled: yes
NTP synchronized: no
RTC in local TZ: no
DST active: n/a
controls@c1bhd:~ 0$ systemctl status systemd-timesyncd -l
● systemd-timesyncd.service - Network Time Synchronization
Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled)
Active: active (running) since Mon 2021-10-04 17:21:29 UTC; 1h 13min ago
Docs: man:systemd-timesyncd.service(8)
Main PID: 244 (systemd-timesyn)
Status: "Idle."
CGroup: /system.slice/systemd-timesyncd.service
└─244 /lib/systemd/systemd-timesyncd
- So the time synchronization is still not working. We expected the FE machined to just synchronize to fb1 even though it doesn't have any upstream ntp server to synchronize to. But that didn't happen.
- I'm (Anchal) working on getting internet access to c1teststand computers.
Digging into mx_stream/daqd_dc errors:
- We went and changed the Restart fileld in /etc/systemd/system/daqd_dc.service on cloned fb1 to 2. This allows the service to fail and stop restarting after two attempts. This allows us to see the real error message instead of the systemd error message that the service is restarting too often. We got following:
controls@fb1:~ 3$ sudo systemctl status daqd_dc -l
● daqd_dc.service - Advanced LIGO RTS daqd data concentrator
Loaded: loaded (/etc/systemd/system/daqd_dc.service; enabled)
Active: failed (Result: exit-code) since Mon 2021-10-04 17:50:25 UTC; 22s ago
Process: 715 ExecStart=/usr/bin/daqd_dc_mx -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.dc (code=exited, status=1/FAILURE)
Main PID: 715 (code=exited, status=1/FAILURE)
Oct 04 17:50:24 fb1 systemd[1]: Started Advanced LIGO RTS daqd data concentrator.
Oct 04 17:50:25 fb1 daqd_dc_mx[715]: [Mon Oct 4 17:50:25 2021] Unable to set to nice = -20 -error Unknown error -1
Oct 04 17:50:25 fb1 daqd_dc_mx[715]: Failed to do mx_get_info: MX not initialized.
Oct 04 17:50:25 fb1 daqd_dc_mx[715]: 263596
Oct 04 17:50:25 fb1 systemd[1]: daqd_dc.service: main process exited, code=exited, status=1/FAILURE
Oct 04 17:50:25 fb1 systemd[1]: Unit daqd_dc.service entered failed state.
- It seemed like the only thing daqd_dc process doesn't like is that mx_stream services are in failed state in teh FE computers. So we did the same process on FE machines to get the real error messages:
controls@fb1:~ 0$ sudo chroot /diskless/root
fb1:/ 0#
fb1:/ 0# sudo nano /etc/systemd/system/mx_stream.service
fb1:/ 0#
fb1:/ 0# exit
- Then I ssh'ed into c1bhd to see the error message on mx_stream service properly.
controls@c1bhd:~ 0$ sudo systemctl daemon-reload
controls@c1bhd:~ 0$ sudo systemctl restart mx_stream
controls@c1bhd:~ 0$ sudo systemctl status mx_stream -l
● mx_stream.service - Advanced LIGO RTS front end mx stream
Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
Active: failed (Result: exit-code) since Mon 2021-10-04 17:57:20 UTC; 24s ago
Process: 11832 ExecStart=/etc/mx_stream_exec (code=exited, status=1/FAILURE)
Main PID: 11832 (code=exited, status=1/FAILURE)
Oct 04 17:57:20 c1bhd systemd[1]: Starting Advanced LIGO RTS front end mx stream...
Oct 04 17:57:20 c1bhd systemd[1]: Started Advanced LIGO RTS front end mx stream.
Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: send len = 263596
Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: OMX: Failed to find peer index of board 00:00:00:00:00:00 (Peer Not Found in the Table)
Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: mx_connect failed Nic ID not Found in Peer Table
Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: c1x06_daq mmapped address is 0x7f516a97a000
Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: c1bhd_daq mmapped address is 0x7f516697a000
Oct 04 17:57:20 c1bhd systemd[1]: mx_stream.service: main process exited, code=exited, status=1/FAILURE
Oct 04 17:57:20 c1bhd systemd[1]: Unit mx_stream.service entered failed state.
- c1sus2 shows the same error. I'm not sure I understand these errors at all. But they seem to have nothing to do with timing issues
!
As usual, some help would be helpful |
16376
|
Mon Oct 4 18:00:16 2021 |
Koji | Summary | CDS | c1teststand problems summary |
I don't know anything about mx/open-mx, but you also need open-mx,don't you?
controls@c1ioo:~ 0$ systemctl status *mx*
● open-mx.service - LSB: starts Open-MX driver
Loaded: loaded (/etc/init.d/open-mx)
Active: active (running) since Wed 2021-09-22 11:54:39 PDT; 1 weeks 5 days ago
Process: 470 ExecStart=/etc/init.d/open-mx start (code=exited, status=0/SUCCESS)
CGroup: /system.slice/open-mx.service
└─620 /opt/3.2.88-csp/open-mx-1.5.4/bin/fma -d
● mx_stream.service - Advanced LIGO RTS front end mx stream
Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
Active: active (running) since Wed 2021-09-22 12:08:00 PDT; 1 weeks 5 days ago
Main PID: 5785 (mx_stream)
CGroup: /system.slice/mx_stream.service
└─5785 /usr/bin/mx_stream -e 0 -r 0 -w 0 -W 0 -s c1x03 c1ioo c1als c1omc -d fb1:0
|
16381
|
Tue Oct 5 17:58:52 2021 |
Anchal | Summary | CDS | c1teststand problems summary |
open-mx service is running successfully on the fb1(clone), c1bhd and c1sus.
Quote: |
I don't know anything about mx/open-mx, but you also need open-mx,don't you?
|
|
16382
|
Tue Oct 5 18:00:53 2021 |
Anchal | Summary | CDS | c1teststand time synchronization working now |
Today I got a new router that I used to connect the c1teststand, fb1 and chiara. I was able to see internet access in c1teststand and fb1, but not in chiara. I'm not sure why that is the case.
The good news is that the ntp server on fb1(clone) is working fine now and both FE computers, c1bhd and c1sus2 are succesfully synchronized to the fb1(clone) ntpserver. This resolves any possible timing issues in this DAQ network.
On running the IOP and user models however, I see the same errors are mentioned in 40m/16372. Something to do with:
Oct 06 00:47:56 c1sus2 mx_stream_exec[21796]: OMX: Failed to find peer index of board 00:00:00:00:00:00 (Peer Not Found in the Table)
Oct 06 00:47:56 c1sus2 mx_stream_exec[21796]: mx_connect failed Nic ID not Found in Peer Table
Oct 06 00:47:56 c1sus2 mx_stream_exec[21796]: c1x07_daq mmapped address is 0x7fa4819cc000
Oct 06 00:47:56 c1sus2 mx_stream_exec[21796]: c1su2_daq mmapped address is 0x7fa47d9cc000
Thu Oct 7 17:04:31 2021
I fixed the issue of chiara not getting internet. Now c1teststand, fb1 and chiara, all have internet connections. It was the issue of default gateway and interface and findiing the DNS. I have found the correct settings now. |
16391
|
Mon Oct 11 17:31:25 2021 |
Anchal | Summary | CDS | Fixed mounting of mx devices in fb. daqd_dc is running now. |
I compared the fb1 in main network with the cloned fb1 and I found a crucial difference. The main fb1 where cds is running fine as mx devices mounted in /dev/ like mx0, mx1 upto mx7, mxctlm mxctlp, mxp0, mxp1 upto mxp7. The cloned fb does not have any of these mx devices mounted. I think this is where the issue was coming in from.
However, lspci | grep 'Myri' shows following output on both computers:
controls@fb1:/dev 0$ lspci | grep 'Myri'
02:00.0 Ethernet controller: MYRICOM Inc. Myri-10G Dual-Protocol NIC (rev 01)
Which means that the computer detects the card on PCie slot.
I tried to add this to /etc/rc.local to run this script at every boot, but it did not work. So for now, I'll just manually do this step everytime. Once the devices are loaded, we get:
controls@fb1:/etc 0$ ls /dev/*mx*
/dev/mx0 /dev/mx4 /dev/mxctl /dev/mxp2 /dev/mxp6 /dev/ptmx
/dev/mx1 /dev/mx5 /dev/mxctlp /dev/mxp3 /dev/mxp7
/dev/mx2 /dev/mx6 /dev/mxp0 /dev/mxp4 /dev/open-mx
/dev/mx3 /dev/mx7 /dev/mxp1 /dev/mxp5 /dev/open-mx-raw
The, restarting all daqd_ processes, I found that daqd_dc was running succesfully now. Here is the status:
controls@fb1:/etc 0$ sudo systemctl status daqd_* -l
● daqd_dc.service - Advanced LIGO RTS daqd data concentrator
Loaded: loaded (/etc/systemd/system/daqd_dc.service; enabled)
Active: active (running) since Mon 2021-10-11 17:48:00 PDT; 23min ago
Main PID: 2308 (daqd_dc_mx)
CGroup: /daqd.slice/daqd_dc.service
├─2308 /usr/bin/daqd_dc_mx -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.dc
└─2370 caRepeater
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: mx receiver 006 thread priority error Operation not permitted[Mon Oct 11 17:48:06 2021]
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: mx receiver 005 thread put on CPU 0
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:48:06 2021] [Mon Oct 11 17:48:06 2021] mx receiver 006 thread put on CPU 0
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: mx receiver 007 thread put on CPU 0
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:48:06 2021] mx receiver 003 thread - label dqmx003 pid=2362
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:48:06 2021] mx receiver 003 thread priority error Operation not permitted
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:48:06 2021] mx receiver 003 thread put on CPU 0
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: warning:regcache incompatible with malloc
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:48:06 2021] EDCU has 410 channels configured; first=0
Oct 11 17:49:06 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:49:06 2021] ->4: clear crc
● daqd_fw.service - Advanced LIGO RTS daqd frame writer
Loaded: loaded (/etc/systemd/system/daqd_fw.service; enabled)
Active: active (running) since Mon 2021-10-11 17:48:01 PDT; 23min ago
Main PID: 2318 (daqd_fw)
CGroup: /daqd.slice/daqd_fw.service
└─2318 /usr/bin/daqd_fw -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.fw
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] [Mon Oct 11 17:48:09 2021] Producer thread - label dqproddbg pid=2440
Oct 11 17:48:09 fb1 daqd_fw[2318]: Producer crc thread priority error Operation not permitted
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] [Mon Oct 11 17:48:09 2021] Producer crc thread put on CPU 0
Oct 11 17:48:09 fb1 daqd_fw[2318]: Producer thread priority error Operation not permitted
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] Producer thread put on CPU 0
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] Producer thread - label dqprod pid=2434
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] Producer thread priority error Operation not permitted
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] Producer thread put on CPU 0
Oct 11 17:48:10 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:10 2021] Minute trender made GPS time correction; gps=1318034906; gps%60=26
Oct 11 17:49:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:49:09 2021] ->3: clear crc
● daqd_rcv.service - Advanced LIGO RTS daqd testpoint receiver
Loaded: loaded (/etc/systemd/system/daqd_rcv.service; enabled)
Active: active (running) since Mon 2021-10-11 17:48:00 PDT; 23min ago
Main PID: 2311 (daqd_rcv)
CGroup: /daqd.slice/daqd_rcv.service
└─2311 /usr/bin/daqd_rcv -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.rcv
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1X07_CRC_SUM
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1BHD_STATUS
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1BHD_CRC_CPS
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1BHD_CRC_SUM
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1SU2_STATUS
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1SU2_CRC_CPS
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1SU2_CRC_SUM
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1OM[Mon Oct 11 17:50:21 2021] Epics server started
Oct 11 17:50:24 fb1 daqd_rcv[2311]: [Mon Oct 11 17:50:24 2021] Minute trender made GPS time correction; gps=1318035040; gps%120=40
Oct 11 17:51:21 fb1 daqd_rcv[2311]: [Mon Oct 11 17:51:21 2021] ->3: clear crc
Now, even before starting teh FE models, I see DC status as ox2bad in the CDS screens of the IOP and user models. The mx_stream service remains in a failed state at teh FE machines and remain the same even after restarting the service.
controls@c1sus2:~ 0$ sudo systemctl status mx_stream -l
● mx_stream.service - Advanced LIGO RTS front end mx stream
Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
Active: failed (Result: exit-code) since Mon 2021-10-11 17:50:26 PDT; 15min ago
Process: 382 ExecStart=/etc/mx_stream_exec (code=exited, status=1/FAILURE)
Main PID: 382 (code=exited, status=1/FAILURE)
Oct 11 17:50:25 c1sus2 systemd[1]: Starting Advanced LIGO RTS front end mx stream...
Oct 11 17:50:25 c1sus2 systemd[1]: Started Advanced LIGO RTS front end mx stream.
Oct 11 17:50:25 c1sus2 mx_stream_exec[382]: Failed to open endpoint Not initialized
Oct 11 17:50:26 c1sus2 systemd[1]: mx_stream.service: main process exited, code=exited, status=1/FAILURE
Oct 11 17:50:26 c1sus2 systemd[1]: Unit mx_stream.service entered failed state.
But if I restart the mx_stream service before starting the rtcds models, the mx-stream service starts succesfully:
controls@c1sus2:~ 0$ sudo systemctl restart mx_stream
controls@c1sus2:~ 0$ sudo systemctl status mx_stream -l
● mx_stream.service - Advanced LIGO RTS front end mx stream
Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
Active: active (running) since Mon 2021-10-11 18:14:13 PDT; 25s ago
Main PID: 1337 (mx_stream)
CGroup: /system.slice/mx_stream.service
└─1337 /usr/bin/mx_stream -e 0 -r 0 -w 0 -W 0 -s c1x07 c1su2 -d fb1:0
Oct 11 18:14:13 c1sus2 systemd[1]: Starting Advanced LIGO RTS front end mx stream...
Oct 11 18:14:13 c1sus2 systemd[1]: Started Advanced LIGO RTS front end mx stream.
Oct 11 18:14:13 c1sus2 mx_stream_exec[1337]: send len = 263596
Oct 11 18:14:13 c1sus2 mx_stream_exec[1337]: Connection Made
However, the DC status on CDS screens still show 0x2bad. As soon as I start the rtcds model c1x07 (the IOP model for c1sus2), the mx_stream service fails:
controls@c1sus2:~ 0$ sudo systemctl status mx_stream -l
● mx_stream.service - Advanced LIGO RTS front end mx stream
Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
Active: failed (Result: exit-code) since Mon 2021-10-11 18:18:03 PDT; 27s ago
Process: 1337 ExecStart=/etc/mx_stream_exec (code=exited, status=1/FAILURE)
Main PID: 1337 (code=exited, status=1/FAILURE)
Oct 11 18:14:13 c1sus2 systemd[1]: Starting Advanced LIGO RTS front end mx stream...
Oct 11 18:14:13 c1sus2 systemd[1]: Started Advanced LIGO RTS front end mx stream.
Oct 11 18:14:13 c1sus2 mx_stream_exec[1337]: send len = 263596
Oct 11 18:14:13 c1sus2 mx_stream_exec[1337]: Connection Made
Oct 11 18:18:03 c1sus2 mx_stream_exec[1337]: isendxxx failed with status Remote Endpoint Unreachable
Oct 11 18:18:03 c1sus2 mx_stream_exec[1337]: disconnected from the sender
Oct 11 18:18:03 c1sus2 mx_stream_exec[1337]: c1x07_daq mmapped address is 0x7fe3620c3000
Oct 11 18:18:03 c1sus2 mx_stream_exec[1337]: c1su2_daq mmapped address is 0x7fe35e0c3000
Oct 11 18:18:03 c1sus2 systemd[1]: mx_stream.service: main process exited, code=exited, status=1/FAILURE
Oct 11 18:18:03 c1sus2 systemd[1]: Unit mx_stream.service entered failed state.
This shows that the start of rtcds model, causes the fail in mx_stream, possibly due to inability of finding the endpoint on fb1. I've again reached to the edge of my knowledge here. Maybe the fiber optic connection between fb and the network switch that connects to FE is bad, or the connection between switch and FEs is bad.
But we are just one step away from making this work.
|
16392
|
Mon Oct 11 18:29:35 2021 |
Anchal | Summary | CDS | Moving forward? |
The teststand has some non-trivial issue with Myrinet card (either software or hardware) which even teh experts are saying they don't remember how to fix it. CDS with mx was iin use more than a decade ago, so it is hard to find support for issues with it now and will be the same in future. We need to wrap up this test procedure one way or another now, so I have following two options moving forward:
Direct integration with main CDS and testing
- We can just connect the c1sus2 and c1bhd FE computers to martian network directly.
- We'll have to connect c1sus2 and c1bhd to the optical fiber subnetwork as well.
- On booting, they would get booted through the exisitng fb1 boot server which seems to work fine for the other 5 FE machines.
- We can update teh DHCP in chiara and reload it so that we can ssh into these FEs with host names.
- Hopefully, presence of these computers won't tank the existing CDS even if they themselves have any issues, as they have no shared memory with other models.
- If this works, we can do the loop back testing of I/O chassis using the main DAQ network and move on with our upgrade.
- If this does not work and causes any harm to exisitng CDS network, we can disconnect these computers and go back to existing CDS. Recently, our confidence on rebooting the CDS has increased with the robust performance as some legacy issues were fixed.
- We'll however, continue to use a CDS which is no more supported by the current LIGO CDS group.
Testing CDS upgrade on teststand
- From what I could gather, most of the hardware in I/O chassis that I could find, is still used in CDS of LLO and LHO, with their recent tests and documents using the same cards and PCBs.
- There might be some difference in the DAQ network setup that I need to confirm.
- I've summarised the current c1teststand hardware on this wiki page.
- If the latest CDS is backwards compatible with our hardware, we can test the new CDS in teh c1teststand setup without disrupting our main CDS. We'll have ample help and support for this upgrade from the current LIGO CDS group.
- We can do the loop back testing of the I/O chassis as well.
- If the upgrade is succesfull in the teststand without many hardware changes, we can upgrade the main CDS of 40m as well, as it has the same hardware as our teststand.
- Biggest plus point would be that out CDS will be up-to-date and we will be able to take help from CDS group if any trouble occurs.
So these are the two options we have. We should discuss which one to take in the mattermost chat or in upcoming meeting. |
16395
|
Tue Oct 12 17:10:56 2021 |
Anchal | Summary | CDS | Some more information |
Chris pointed out some information displaying scripts, that show if the DAQ network is working or not. I thought it would be nice to log this information here as well.
controls@fb1:/opt/mx/bin 0$ ./mx_info
MX Version: 1.2.16
MX Build: controls@fb1:/opt/src/mx-1.2.16 Mon Aug 14 11:06:09 PDT 2017
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0: 364.4 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
Status: Running, P0: Link Up
Network: Ethernet 10G
MAC Address: 00:60:dd:45:37:86
Product code: 10G-PCIE-8B-S
Part number: 09-04228
Serial number: 423340
Mapper: 00:60:dd:45:37:86, version = 0x00000000, configured
Mapped hosts: 3
ROUTE COUNT
INDEX MAC ADDRESS HOST NAME P0
----- ----------- --------- ---
0) 00:60:dd:45:37:86 fb1:0 1,0
1) 00:25:90:05:ab:47 c1bhd:0 1,0
2) 00:25:90:06:69:c3 c1sus2:0 1,0
controls@c1bhd:~ 1$ /opt/open-mx/bin/omx_info
Open-MX version 1.5.4
build: root@fb1:/opt/src/open-mx-1.5.4 Tue Aug 15 23:48:03 UTC 2017
Found 1 boards (32 max) supporting 32 endpoints each:
c1bhd:0 (board #0 name eth1 addr 00:25:90:05:ab:47)
managed by driver 'igb'
Peer table is ready, mapper is 00:60:dd:45:37:86
================================================
0) 00:25:90:05:ab:47 c1bhd:0
1) 00:60:dd:45:37:86 fb1:0
2) 00:25:90:06:69:c3 c1sus2:0
controls@c1sus2:~ 0$ /opt/open-mx/bin/omx_info
Open-MX version 1.5.4
build: root@fb1:/opt/src/open-mx-1.5.4 Tue Aug 15 23:48:03 UTC 2017
Found 1 boards (32 max) supporting 32 endpoints each:
c1sus2:0 (board #0 name eth1 addr 00:25:90:06:69:c3)
managed by driver 'igb'
Peer table is ready, mapper is 00:60:dd:45:37:86
================================================
0) 00:25:90:06:69:c3 c1sus2:0
1) 00:60:dd:45:37:86 fb1:0
2) 00:25:90:05:ab:47 c1bhd:0
These outputs prove that the framebuilder and the FEs are able to see each other in teh DAQ network.
Further, the error that we see when IOP model is started which crashes the mx_stream service on the FE machines (see 40m/16391) :
isendxxx failed with status Remote Endpoint Unreachable
This has been seen earlier when Jamie was troubleshooting the current fb1 in martian network in 40m/11655 in Oct, 2015. Unfortunately, I could not find what Jamie did over a year to fix this issue. |
16396
|
Tue Oct 12 17:20:12 2021 |
Anchal | Summary | CDS | Connected c1sus2 to martian network |
I connected c1sus2 to the martian network by splitting the c1sim connection with a 5-way switch. I also ran another ethernet cable from the second port of c1sus2 to the DAQ network switch on 1X7.
Then I logged into chiara and added the following in chiara:/etc/dhcp/dhcpd.conf :
host c1sus2 {
hardware ethernet 00:25:90:06:69:C2;
fixed-address 192.168.113.92;
}
And following line in chiara:/var/lib/bind/martian.hosts :
c1sus2 A 192.168.113.92
Note that entires c1bhd is already added in these files, probably during some earlier testing by Gautam or Jon. Then I ran following to restart the dhcp server and nameserver:
~> sudo service bind9 reload
[sudo] password for controls:
* Reloading domain name service... bind9 [ OK ]
~> sudo service isc-dhcp-server restart
isc-dhcp-server stop/waiting
isc-dhcp-server start/running, process 25764
Now, As I switched on c1sus2 from front panel, it booted over network from fb1 like other FE machines and I was able to login to it by first logging to fb1 and then sshing to c1sus2.
Next, I copied the simulink models and the medm screens of c1x06, xc1x07, c1bhd, c1sus2 from the paths mentioned on this wiki page. I also copied the medm screens from chiara(clone):/opt/rtcds/caltech/c1/medm to martian network chiara in the appropriate places. I have placed the file /opt/rtcds/caltech/c1/medm/teststand_sitemap.adl which can be used to open sitemap for c1bhd and c1sus2 IOP and user models.
Then I logged into c1sus2 (via fb1) and did make, install, start procedure:
controls@c1sus2:~ 0$ rtcds make c1x07
buildd: /opt/rtcds/caltech/c1/rtbuild/release
### building c1x07...
Cleaning c1x07...
Done
Parsing the model c1x07...
Done
Building EPICS sequencers...
Done
Building front-end Linux kernel module c1x07...
Done
RCG source code directory:
/opt/rtcds/rtscore/branches/branch-3.4
The following files were used for this build:
/opt/rtcds/userapps/release/cds/c1/models/c1x07.mdl
Successfully compiled c1x07
***********************************************
Compile Warnings, found in c1x07_warnings.log:
***********************************************
***********************************************
controls@c1sus2:~ 0$ rtcds install c1x07
buildd: /opt/rtcds/caltech/c1/rtbuild/release
### installing c1x07...
Installing system=c1x07 site=caltech ifo=C1,c1
Installing /opt/rtcds/caltech/c1/chans/C1X07.txt
Installing /opt/rtcds/caltech/c1/target/c1x07/c1x07epics
Installing /opt/rtcds/caltech/c1/target/c1x07
Installing start and stop scripts
/opt/rtcds/caltech/c1/scripts/killc1x07
/opt/rtcds/caltech/c1/scripts/startc1x07
sudo: unable to resolve host c1sus2
Performing install-daq
Updating testpoint.par config file
/opt/rtcds/caltech/c1/target/gds/param/testpoint.par
/opt/rtcds/rtscore/branches/branch-3.4/src/epics/util/updateTestpointPar.pl -par_file=/opt/rtcds/caltech/c1/target/gds/param/archive/testpoint_211012_174226.par -gds_node=24 -site_letter=C -system=c1x07 -host=c1sus2
Installing GDS node 24 configuration file
/opt/rtcds/caltech/c1/target/gds/param/tpchn_c1x07.par
Installing auto-generated DAQ configuration file
/opt/rtcds/caltech/c1/chans/daq/C1X07.ini
Installing Epics MEDM screens
Running post-build script
safe.snap exists
controls@c1sus2:~ 0$ rtcds start c1x07
Cannot start/stop model 'c1x07' on host c1sus2.
controls@c1sus2:~ 4$ rtcds list
controls@c1sus2:~ 0$
One can see that even after making and installing, the model c1x07 is not listed as available models in rtcds list. Same is the case for c1sus2 as well. So I could not proceed with testing.
Good news is that nothing that I did affect the current CDS functioning. So we can probably do this testing safely from the main CDS setup. |
16397
|
Tue Oct 12 23:42:56 2021 |
Koji | Summary | CDS | Connected c1sus2 to martian network |
Don't you need to add the new hosts to /diskless/root/etc/rtsystab at fb1? --> There looks many elogs talking about editing "rtsystab".
controls@fb1:/diskless/root/etc 0$ cat rtsystab
#
# host list of control systems to run, starting with IOP
#
c1iscex c1x01 c1scx c1asx
c1sus c1x02 c1sus c1mcs c1rfm c1pem
c1ioo c1x03 c1ioo c1als c1omc
c1lsc c1x04 c1lsc c1ass c1oaf c1cal c1dnn c1daf
c1iscey c1x05 c1scy c1asy
#c1test c1x10 c1tst2
|
16398
|
Wed Oct 13 11:25:14 2021 |
Anchal | Summary | CDS | Ran c1sus2 models in martian CDS. All good! |
Three extra steps (when adding new models, new FE):
- Chris pointed out that the sudo command in c1sus2 is giving error
sudo: unable to resolve host c1sus2
This error comes in when the computer could not figure out it's own hostname. Since FEs are network booted off the fb1, we need to update the /etc/hosts in /diskless/root everytime we add a new FE.
controls@fb1:~ 0$ sudo chroot /diskless/root
fb1:/ 0# sudo nano /etc/hosts
fb1:/ 0# exit
I added the following line in /etc/hosts file above:
192.168.113.92 c1sus2 c1sus2.martian
This resolved the issue of sudo giving error. Now, the rtcds make and install steps had no errors mentioned in their outputs.
- Another thing that needs to be done, as Koji pointed out, is to add the host and models in /etc/rtsystab in /diskless/root of fb:
controls@fb1:~ 0$ sudo chroot /diskless/root
fb1:/ 0# sudo nano /etc/rtsystab
fb1:/ 0# exit
I added the following lines in /etc/rtsystab file above:
c1sus2 c1x07 c1su2
This told rtcds what models would be available on c1sus2. Now rtcds list is displaying the right models:
controls@c1sus2:~ 0$ rtcds list
c1x07
c1su2
- The above steps are still not sufficient for the daqd_ processes to know about the new models. This part is supossed to happen automatically, but does not happen in our CDS apparently. So everytime there is a new model, we need to edit the file /opt/rtcds/caltech/c1/target/daqd/master and add following lines to it:
# Fast Data Channel lists
# c1sus2
/opt/rtcds/caltech/c1/chans/daq/C1X07.ini
/opt/rtcds/caltech/c1/chans/daq/C1SU2.ini
# test point lists
# c1sus2
/opt/rtcds/caltech/c1/target/gds/param/tpchn_c1x07.par
/opt/rtcds/caltech/c1/target/gds/param/tpchn_c1su2.par
I needed to restart the daqd_ processes in fb1 for them to notice these changes:
controls@fb1:~ 0$ sudo systemctl restart daqd_*
This finally lit up the status channels of DC in C1X07_GDS_TP.adl and C1SU2_GDS_TP.adl . However the channels C1:DAQ-DC0_C1X07_STATUS and C1:DAQ-DC0_C1SU2_STATUS both have values 0x2bad. This persists on restarting the models. I then just simply restarted teh mx_stream on c1sus2 and boom, it worked! (see attached all green screen, never seen before!)
So now Ian can work on testing the I/O chassis and we would be good to move c1sus2 FE and I/O chassis to 1Y3 after that. I've also done following extra changes:
- Updated CDS_FE_STATUS medm screen to show the new c1sus2 host.
- Updated global diag rest script to act on c1xo7 and c1su2 as well.
- Updated mxstream restart script to act on c1sus2 as well.
|
Attachment 1: CDS_screens_running.png
|
|
16414
|
Tue Oct 19 18:20:33 2021 |
Ian MacMillan | Summary | CDS | c1sus2 DAC to ADC test |
I ran a DAC to ADC test on c1sus2 channels where I hooked up the outputs on the DAC to the input channels on the ADC. We used different combinations of ADCs and DACs to make sure that there were no errors that cancel each other out in the end. I took a transfer function across these channel combinations to reproduce figure 1 in T2000188.
As seen in the two attached PDFs the channels seem to be working properly they have a flat response with a gain of 0.5 (-6 dB). This is the response that is expected and is the result of the DAC signal being sent as a single ended signal and the ADC receiving as a differential input signal. This should result in a recorded signal of 0.5 the amplitude of the actual output signal.
The drop off on the high frequency end is the result of the anti-aliasing filter and the anti-imaging filter. Both of these are 8-pole elliptical filters so when combined we should get a drop off of 320dB per decade. I measured the slope on the last few points of each filter and the averaged value was around 347dB per decade. This is slightly steeper than expected but since it is to cut off higher frequencies it shouldn't have an effect on the operation of the system. Also it is very close to the expected value.
The ripples seen before the drop off are also an effect of the elliptical filters and are seen in T2000188.
Note: the transfer function that doesn't seem to match the others is the heartbeat timing signal. |
Attachment 1: data3_Plots.pdf
|
|
Attachment 2: data2_Plots.pdf
|
|
16415
|
Tue Oct 19 23:43:09 2021 |
Koji | Summary | CDS | c1sus2 DAC to ADC test |
(Because of a totally unrelated reason) I was checking the electronics units for the upgrade. And I realized that the electronics units at the test stand have not been properly powered.
I found that the AA/AI stack at the test stand (Attachment 1) has an unusual powering configuration (Attachment 2).
- Only the positive power supply was used / - The supply voltage is only +15V / - The GND reference is not connected to anywhere.
For confirmation, I checked the voltage across the DC power strip (Attachments 3/4). The positive was +5.3V and the negative was -9.4V. This is subject to change depending on the earth potential.
This is not a good condition at all. The asymmetric powering of the circuit may cause damages to the opamps. So I turned off the switches of the units.
The power configuration should be immediately corrected.
- Use both positive and negative supply (2 power supply channels) to produce the positive and the negative voltage potentials. Connect the reference potential to the earth post of the power supply.
https://www.youtube.com/watch?v=9_6ecyf6K40 [Dual Power Supply Connection / Serial plus minus electronics laboratory PS with center tap]
- These units have DC power regulator which produces +/-15V out of +/-18V. So the DC power supplies are supposed to be set at +18V.
|
Attachment 1: P_20211019_224433.jpg
|
|
Attachment 2: P_20211019_224122.jpg
|
|
Attachment 3: P_20211019_224400.jpg
|
|
Attachment 4: P_20211019_224411.jpg
|
|
16417
|
Wed Oct 20 11:48:27 2021 |
Anchal | Summary | CDS | Power supple configured correctly. |
This was horrible! That's my bad, I should have checked the configuration before assuming that it is right.
I fixed the power supply configuration. Now the strip has two rails of +/- 18V and the GND is referenced to power supply earth GND.
Ian should redo the tests. |
16430
|
Tue Oct 26 18:24:00 2021 |
Ian MacMillan | Summary | CDS | c1sus2 DAC to ADC test |
[Ian, Anchal, Paco]
After the Koji found that there was a problem with the power source Anchal and I fixed the power then reran the measurment. The only change this time around is that I increased the excitation amplitude to 100. In the first run the excitation amplitude was 1 which seemed to come out noise free but is too low to give a reliable value.
link to previous results
The new plots are attached. |
Attachment 1: data2_Plots.pdf
|
|
Attachment 2: data3_Plots.pdf
|
|
16495
|
Thu Dec 9 00:32:56 2021 |
Tega | Update | CDS | New SUS medm screen update |
The new SUS screen can be reached via sitemap -> IFO SUS button -> NEW ETMX dropdown menu link. Please use and provide feedback. Not sure exactly if we need/want the display screens after the IOP model on the right of the medm screen. I have not been able to locate the corresponding channels but did not want to remove them until I was sure that we don't plan to add these features to our screens. When all bugs have been ironed out, we can use appropriate macro substitution for the other optics.
The next feature to add is the BLRMS to the coil and PD channels. I plan to combine the PEM BLRMS medm implementation with the sus_single_BLRMS model block (located in /opt/rtcds/userapps/release/cds/c1/models). This way we use the latest BLRMS block in "/opt/rtcds/userapps/release/cds/common/models/BLRMS.mdl" whilst also leveraging the previous work done on the sus_single_BLRMS model, which neatly fits into our current SUS model. |
Attachment 1: Screen_Shot_2021-12-09_at_12.29.30_AM.png
|
|
Attachment 2: Screen_Shot_2021-12-09_at_12.42.35_AM.png
|
|
16496
|
Thu Dec 9 18:22:36 2021 |
Tega | Update | CDS | New SUS medm screen update |
Work on the medm screen for SUS RMS monitor is ongoing. The next step would be to incorporate this into the SUS medm screen, add the BLRMS model to the SUS controller model, recompile, check that the channels are being correctly addressed, then load the appropriate bandpass and lowpass filters. |
Attachment 1: Screen_Shot_2021-12-09_at_6.21.09_PM.png
|
|
16500
|
Fri Dec 10 18:55:58 2021 |
Tega | Update | CDS | New SUS medm screen update |
Turns out the BLRMS monitoring channels for MC1, MC2, MC3, ITMY and SRM already exist in c1pem. So I modified the new SUS screen to display the BLRMS info for the aforementioned optics. Next step is to add the BLRMS monitor for PRM, ITMX, ETMX and ETMY. This would require extending the number of inputs for the "SUS" block in c1pem to accomodate the additional inputs from the remaining optics. |
Attachment 1: BLRMS_ITMY_screenshot.png
|
|
16533
|
Wed Dec 22 17:40:22 2021 |
Anchal | Summary | CDS | c1su2 model updated with SUS damping blocks for 7 SOSs |
[Anchal, Koji]
I've updated the c1su2 model today with model suspension blocks for the 7 new SOSs (LO1, LO2, AS1, AS4, SR2, PR2 and PR3). The model is running properly now but we had some difficulty in getting it to run.
Initially, we were getting 0x2000 error on the c1su2 model CDS screen. The issue probably was high data transmission required for all the 7 SOSs in this model. Koji dug up a script /opt/rtcds/caltech/c1/userapps/trunk/cds/c1/scripts/activateDQ.py that has been used historically for updating the data rate on some of theDQ channels in the suspension block. However, this script was not working properly for Koji, so he create a new script at /opt/rtcds/caltech/c1/chans/daq/activateSUS2DQ.py.
[Ed by KA: I could not make this modified script run so that I replaces the input file (i.e. C1SU2.ini). So the output file is named C1SU2.ini.NEW and need to manually replace the original file.]
With this, Koji was able to reduce acquisition rate of SUSPOS_IN1_DQ, SUSPIT_IN1_DQ, SUSYAW_IN1_DQ, SUSSIDE_IN1_DQ, SENSOR_UL, SENSOR_UR, SENSOR_LL,SENSOR_LR, SENSOR_SIDE, OPLEV_PERROR, OPLEV_YERROR, and OPLEV_SUM to 2048 Sa/s. The script modifies the /opt/rtcds/caltech/c1/chans/daq/C1SU2.ini file which would get re-written if c1su2 model is remade and reinstalled. After this modification, the 0x2000 error stopped appearing and the model is running fine.
Should we change the library model part for sus_single_control.mdl
We notice that all our suspension models need to go through this weird python script modifying auto-generated .ini files to reduce the data rate. Ideally, there is a simpler solution to this by simply adding the datarate 2048 in the '#DAQ Channels' block in the model library part /cvs/cds/rtcds/userapps/trunk/sus/c1/models/lib/sus_single_control.mdl which is the root model in all the suspensions. With this change, the .ini files will automatically be written with correct datarate and there will be no need for using the activateDQ script. But we couldn't find why this simple solution was not implemented in the past, so we want to know if there is more stuff going on here then we know. Changing the library model would obviously change every suspension model and we don't want a broken CDS system on our head at the begining of holidays, so we'll leave this delicate task for the near future. |
16537
|
Wed Dec 29 20:09:40 2021 |
rana | Summary | CDS | c1su2 model updated with SUS damping blocks for 7 SOSs |
We want to maintain the 16 kHz sample rate for the COIL DAQ channels, but nothing wrong with reducing the others.
I would suggest setting the DQ sample rates to 256 Hz for the SUS DAMP channels and 1024 Hz for the OPLEV channels (for noise diagnostics).
Maybe you can put these numbers into a new library part and we can have the best of all worlds?
Quote: |
Should we change the library model part for sus_single_control.mdl
We notice that all our suspension models need to go through this weird python script modifying auto-generated .ini files to reduce the data rate. Ideally, there is a simpler solution to this by simply adding the datarate 2048 in the '#DAQ Channels' block in the model library part /cvs/cds/rtcds/userapps/trunk/sus/c1/models/lib/sus_single_control.mdl which is the root model in all the suspensions. With this change, the .ini files will automatically be written with correct datarate and there will be no need for using the activateDQ script. But we couldn't find why this simple solution was not implemented in the past, so we want to know if there is more stuff going on here then we know. Changing the library model would obviously change every suspension model and we don't want a broken CDS system on our head at the begining of holidays, so we'll leave this delicate task for the near future.
|
|
16546
|
Thu Jan 6 12:52:49 2022 |
Anchal | Update | CDS | Yearly DAQD fix 2022! |
Just as predicted, all realtime models reported "0x4000" error. Read the parent post for more details. I fixed this by following the instructions. I add folowing lines to the file /opt/rtcds/rtscore/release/src/include/drv/spectracomGPS.c in fb1:
/* 2020 had 366 days and no leap second */
pHardware->gpsOffset += 31622400;
/* 2021 had no leap seconds or leap days, so adjust for that */
pHardware->gpsOffset += 31536000;
Then is made the package and reloaded it after stoping the daqd services. This brought back all the fast models except C1SUS2 models which are in red due to some other reason that I'll investigate further.
|
16547
|
Thu Jan 6 13:54:28 2022 |
Koji | Update | CDS | Yearly DAQD fix 2022! |
Just restarting all the c1sus2 models fixed the issue. (Attachment 1)
SUS2 ADC1 CH21 is saturated. I'm not yet sure if this is the electronics issue or the ADC issue.
SUS2 ADC1 CH10 also has large offset. This should also be investiagted. |
Attachment 1: Screenshot_2022-01-06_13-57-40.png
|
|
16548
|
Thu Jan 6 14:08:14 2022 |
Koji | Update | CDS | More BHD SUS screens added to sitemap |
More BHD SUS screens added to sitemap (Attachment 1) |
Attachment 1: Screenshot_2022-01-06_14-06-15.png
|
|
16553
|
Thu Jan 6 22:18:47 2022 |
Koji | Update | CDS | SUS screen debugging |
Indicated by the red arrow:
Even when the side damping servo is off, the number appears at the input of the output matrix
Indicated by the green arrows:
The face magnets and the side magnets use different ADCs. How about opening a custom ADC panel that accommodates all ADCs at once? Same for the DAC.
Indicated by the blue arrows:
This button opens a custom FM window. When the pitch gain was modified with a ramping time, the pitch and yaw gain grows at the same time even though only the pitch gain was modified.
Indicated by the orange circle:
The numbers are not indicated here, but they are input-related numbers (for watchdogging) rather than output-related numbers. It is confusing to place them here. |
Attachment 1: Screen_Shot_2022-01-06_at_18.03.24.png
|
|
16570
|
Tue Jan 11 10:46:07 2022 |
Tega | Update | CDS | SUS screen debugging |
Seen. Thanks.
Red Arrow: The channel was labeled incorrectly as INMON instead of OUTPUT
Green Arrow: OK, I will create a custom medm screen for this.
Blue arrow: Hmm, OK I will look into this. Doing this work remotely is a pain as the medm response is quite slow for poking around.
Orange circle: OK, I'll move this to the left side of the line.
Note to self: I also noticed another error on the side (LPYS blue box just b4 the sum). The channel is pointing to YAW instead of the side, so needs to be fixed as well.
Quote: |
Indicated by the red arrow:
Even when the side damping servo is off, the number appears at the input of the output matrix
Indicated by the green arrows:
The face magnets and the side magnets use different ADCs. How about opening a custom ADC panel that accommodates all ADCs at once? Same for the DAC.
Indicated by the blue arrows:
This button opens a custom FM window. When the pitch gain was modified with a ramping time, the pitch and yaw gain grows at the same time even though only the pitch gain was modified.
Indicated by the orange circle:
The numbers are not indicated here, but they are input-related numbers (for watchdogging) rather than output-related numbers. It is confusing to place them here.
|
|
16611
|
Fri Jan 21 12:46:31 2022 |
Tega | Update | CDS | SUS screen debugging |
All done (almost)! I still have not sorted the issue of pitch and yaw gains growing together when modified using ramping time. Image of custom ADC and DAC panel is attached.
Quote: |
Seen. Thanks.
Quote: |
Indicated by the red arrow:
Even when the side damping servo is off, the number appears at the input of the output matrix
Indicated by the green arrows:
The face magnets and the side magnets use different ADCs. How about opening a custom ADC panel that accommodates all ADCs at once? Same for the DAC.
Indicated by the blue arrows:
This button opens a custom FM window. When the pitch gain was modified with a ramping time, the pitch and yaw gain grows at the same time even though only the pitch gain was modified.
Indicated by the orange circle:
The numbers are not indicated here, but they are input-related numbers (for watchdogging) rather than output-related numbers. It is confusing to place them here.
|
|
|
Attachment 1: Custom_ADC_DAC_monitors.png
|
|
16662
|
Thu Feb 10 21:16:27 2022 |
Koji | Summary | CDS | chiara resolv.conf wierdo |
During the videomux debug, I noticed that the host name resolving on chiara didn't behave well. Basically I could not login to anything from chiara using host names.
I found that there was no /etc/resolv.conf. Instead, there is /etc/resolvconf directory.
According to my research, live resolv.conf is placed in /run/resolveconf/resolv.conf .
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 192.168.113.20
nameserver 131.215.125.1
nameserver 8.8.8.8
This 113.20 is directing an old "linux1" machine. Too much obsolete. If I modify this file as
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 192.168.113.104
nameserver 131.215.125.1
nameserver 8.8.8.8
search martian
Then the name resolving became reasonable. However, during rebooting / service resetting / etc, resolvconf -u command is executed and /run/resolveconf/resolv.conf is overridden, as indicated in the file.
I have modified /etc/resolvconf/resolv.conf.d/base to include 192.168.113.104 and search martian . The latter was included but the former did not show up.
FInally I figured out that, after the resolv.conf is constructed from base and head files in /etc/resolvconf/resolv.conf.d/ , NetworkManager overrides the nameserver addresses.
The configuration was found in /etc/NetworkManager/system-connections/Wired\ connection\ 1 .
Here is the modified setting (dns entry was modified)
>sudo cat /etc/NetworkManager/system-connections/Wired\ connection\ 1
[sudo] password for controls:
[802-3-ethernet]
duplex=full
mac-address=68:05:CA:36:4E:B4
[connection]
id=Wired connection 1
uuid=ed177e70-d10e-42be-8165-3bf59f8f199d
type=802-3-ethernet
timestamp=1438810765
[ipv6]
method=auto
[ipv4]
method=manual
dns=192.168.113.104;131.215.125.1;8.8.8.8;
addresses1=192.168.113.104;24;192.168.113.2;
And
>cat /etc/resolvconf/resolv.conf.d/base
search martian
# See Also /etc/NetworkManager/system-connections/Wired\ connection\ 1
So complicated...
|
16663
|
Thu Feb 10 21:51:02 2022 |
Koji | Update | CDS | [Solved] Huge random numbers flowing into ETMX/ETMY ASC PIT/YAW |
Huge random numbers are flowing into ETMX/ETMY ASC PIT/YAW. Because of this, I could not damp the ETMX/ETMY suspension at the beginning during the recovery from rebooting. (Attachment 1)
By turning off the output of the ASC filters, the mirrors were successfully damped.
Looking at the FE model view of the end RTSs, there were two possibilities: (Attachment 2)
- They are coming from RFM connection
- They are coming from ASXASY
ASX/ASY are not active and I could not see anything producing these numbers. Burtrestore didn't help.
The possibility was something at the other side of the RFM, or corruption of the RFM signal.
- Looking at the RFM model (Attachment 3), the ASC signals are coming from ASS and IOO. The ASS path has the filter module (C1:RFM-ETMX_PIT and etc). This FM is quiet and not guilty.
- Why do we have the RFM from IOO? I went to IOO and found the new ASC (WFS) model is there. I didn't realize the presence of this model. In fact ASC screen showed that these random numbers are flowing into the end SUSs.
So I did burtrestore of c1iooepics. Alas! they are gone.
Now I can go home. |
Attachment 1: Screenshot_2022-02-10_21-46-02.png
|
|
Attachment 2: Screen_Shot_2022-02-10_at_21.54.21.png
|
|
Attachment 3: Screen_Shot_2022-02-10_at_22.14.23.png
|
|
16664
|
Fri Feb 11 10:56:38 2022 |
Anchal | Update | CDS | [Solved] Huge random numbers flowing into ETMX/ETMY ASC PIT/YAW |
Yeah, this is a known issue actually. We go to ASC screen and manually swich off all the outputs after every reboot. We haven't been able to find a way to set default so that when the model comes online, these outputs remain switched off. We should find a way for this.
|
16666
|
Fri Feb 11 12:22:19 2022 |
rana | Update | CDS | [Solved] Huge random numbers flowing into ETMX/ETMY ASC PIT/YAW |
you can hand edit the autoBurt file which the FE uses to set the values after boot up. Just make a python script that amends all of the OFF or ZERO that are needed to make things safe. This would be the autoBurt snap used on boot up only, and not the hourly snaps.
|
Yeah, this is a known issue actually. We go to ASC screen and manually swich off all the outputs after every reboot. We haven't been able to find a way to set default so that when the model comes online, these outputs remain switched off. We should find a way for this.
|
|
16668
|
Fri Feb 11 17:07:19 2022 |
Anchal | Update | CDS | [Solved] Huge random numbers flowing into ETMX/ETMY ASC PIT/YAW |
The autoBurt file for FE already has the C1:ASC-ETMX_PIT_SW2 (and other channels for ETMY, ITMX, ITMY, BS and for YAW) present, and I checked the last snapshot file from Feb 7th, 2022, which has 0 for these channels. So I'm not sure why when FE boots up, it does not follow the switch configuration. Fr safety, I changed all the gains of these filter modules, named like C1:ASC-XXXX_YYY_GAIN (where XXXX is ETMX, ETMY, ITMX, ITMY, or BS , and YYY is PIT or YAW) to 0.0. Now, even if the FE loads with switches in ON configuration, nothing should happen. In future, if we use this model for anything, we can change the gain values which won't be hard to track as the reason why no signal moves forward. Note, the BS connections from this model to BS suspension model do not work.
Quote: |
you can hand edit the autoBurt file which the FE uses to set the values after boot up. Just make a python script that amends all of the OFF or ZERO that are needed to make things safe. This would be the autoBurt snap used on boot up only, and not the hourly snaps.
|
Yeah, this is a known issue actually. We go to ASC screen and manually swich off all the outputs after every reboot. We haven't been able to find a way to set default so that when the model comes online, these outputs remain switched off. We should find a way for this.
|
|
|
16697
|
Thu Mar 3 15:37:40 2022 |
Anchal | Summary | CDS | c1teststand restructured |
c1teststand has been restructured. There is no port computer called 'c1teststand' anymore. When you ssh into the c1teststand network using ssh c1teststand from inside martian or from outside network using the method mentioned in this wiki page , you would land into chiara (clone) computer and you can navigate into any teststand network computer from there.
I'll be repurposing 1U c1teststand computer into the new c1susaux2 slow machine now. All files from home directory and from /etc directory of former c1teststand have been zipped and stored in /home/controls of chiara (clone). Just a aside, the network configuration of teststand can be done from inside the teststand network, by going to a browser on either fb1 (clone) or chair (clone) and going to address 10.0.1.1. The login and password are same as our usual workstation username and password. |
16700
|
Fri Mar 4 11:04:34 2022 |
Anchal | Summary | CDS | c1susaux2 system setup and running |
I took the c1teststand computer from teststand and converted it into c1susaux2. To do so, I installed a fresh copy of debian 10 on it and followed the steps on this wiki page. I did some parts slightly differently though. The directory /cvs/cds/caltecg/c1susaux2 is a repository and contains the service unit file modbusIOC.service as well. A symbolic link is created at /etc/systemd/system to use this service file for creating the modbusIOC service. All db files are generated by parsing the acromag chassis wiring file using this python script.
The service file is running without any errors now and all channels are available. The leftmost bench on EEshop at 40m is now ready to do LO1 slow controls and monitor testing. If someone gets time today, they can hookup an unused coil driver to the chassis and verify ENABLE switching and monitoring through the optical isolators. We can also drive some voltage on the PD monitors and verify the functioning of our ADCs. Once this test passes, it is straight forward to finish the remaining 6 SOS wiring and we would be good to install the chassis.
Attaching wiring diagram of c1susuaux2 acromag chassis. Any comments/modification suggestions should come soon as we'll go ahead and wire it soon.
Note: While accessing channels using caget on c1susuaux2, you might get a warning "Identical process variable names on multiple servers". You can safely ignore it. It just means that channel is accessible on that particular computer via two different network interfaces (martian network eno1 and acromag subnetwork eno2) and it will just pick one of them. |
Attachment 1: 40mBHD_C1SUSAUX2_Acromag_Chassis.pdf
|
|
16702
|
Sat Mar 5 01:58:49 2022 |
Koji | Summary | CDS | paola rescue |
ETMY end ThinkPad "paola" could not reboot due to "Fan Error". It seems that it is the failure of the CPU fan. I really needed a functional laptop at the end for the electronics work, I decided to open the chassis. By removing the marked screws at the bottom lid, the keyboard was lifted. I found that the CPU fan was stuck because of accumulated dust. Once the fan was cleaned, the laptop starts up as before. |
Attachment 1: PXL_20220305_035255834.jpg
|
|
Attachment 2: PXL_20220305_034649120.MP.jpg
|
|
16712
|
Mon Mar 7 19:38:47 2022 |
Anchal | Summary | CDS | c1susaux2 slow controls issues |
I tried to perform a simple enabling test of coils using c1susaux2 modbus channels but failed. I'm able to do the enabling of coils using the windows GUI of acromag card but I can not do it when the cards are connected to the computer subnetwork. The issue is two-fold:
- The enable channels such as C1:SUS-LO1_UL_ENABLE are not changing values when their DOL changes a value. In this case, I created a calc channel C1:SUS-LO1_ALL_CALC which takes the AND of all coil's individual CALC channels which are normally used as DOL for the ENABLE channels. But even though the changes are reflected properly to C1:SUS-LO1_ALL_CALC, it does not affect C1:SUS-LO1_UL_ENABLE. See the db files here for more info.
- I tried to directly change the value of C1:SUS-LO1_UL_ENABLE using caput and even though in soft value the channel changes, it does not propagate a change at the output of Acromag card. So my suspicion is that something might be off with the setting of the Acromag card or c1susuaux2.cmd file. I followed this wiki page instructions, but if anyone can find an error, it would be useful.
There's also an issue in reading back the ENABLE_MON channels. Here we suspect that one of the optical isolator box that we have been using might have a short in one of it's output channel. I'll investigate this more tomorrow. Again, the issue is two-fold. The EPICS channel values do not really change. So there is clearly some issue of communicating with the acromag cards. |
16724
|
Mon Mar 14 12:20:05 2022 |
Anchal | Summary | CDS | c1susaux2 slow controls acromag chassis installed |
[Anchal, Yehonathan, Ian]
We installed c1susaux2 acromag chassis in 1Y0 with c1susaux2 computer. We connected PD monitors, Binary inputs, Binary outputs, and Run/Acquire RTS signals for 6 of the 7 suspensions. We ran out of DB9 cables to connect PR3. Of the ones that were connected, LO2, AS1, AS4, SR2, and PR2 are showing no issues in the functionality from the chassis. For LO1, everything is working except for UR EnableMon channel. The enable monitor does not show an ON state for the coil even though the coil driver chassis shows that it is ON via the LED lights. A possible reason could be that a wire got disconnected when we closed the chassis (there are a lot of wires pushing against each other. Another reason could be that the optical isolator ISO10 could have developed a bad channel on channel 2. The circuit was tested before closing the chassis, so not sure what went wrong after closing it.
PR2 is showing a non-acromag chassis related issue. As soon as we close the loop by enabling the coils, the watchdog triggers because the loop is unstable. Not sure what has changed for PR2, but someone should take a look at it.
For the issue with LO1, I suggest we keep a note that the C1:SUS-LO1_UR_ENABLEMon channel is faulty and don't take its value seriously. We should diagnose and fix this issue once we have more reasons to disconnect the chassis and open it.
|
Attachment 1: BHD_WatchDogs.png
|
|
Attachment 2: 40mBHD_C1SUSAUX2_Acromag_Chassis.pdf
|
|