40m QIL Cryo_Lab CTN SUS_Lab TCS_Lab OMC_Lab CRIME_Lab FEA ENG_Labs OptContFac Mariner WBEEShop
  40m Log, Page 61 of 344  Not logged in ELOG logo
ID Date Author Type Categoryup Subject
  16225   Fri Jun 25 14:06:10 2021 JonUpdateCDSFront-End Assembly and Testing

Summary

Here is the final summary (from me) of where things stand with the new front-end systems. With Anchal and Ian's recent scripted loopback testing [16224], all the testing that can be performed in isolation with the hardware on hand has been completed. We currently have no indication of any problem with the new hardware. However, the high-frequency signal integrity and noise testing remains to be done.

I detail those tests and link some DTT templates for performing them below. We have not yet received the Myricom 10G network card being sent from LHO, which is required to complete the standalone DAQ network. Thus we do not have a working NDS server in the test stand, so cannot yet run any of the usual CDS tools such as Diaggui. Another option would be to just connect the new front-ends to the 40m Martian/DAQ networks and test them there.

Final Hardware Configuration

Due to the unavailablity of the 18-bit DACs that were expected from the sites, we elected to convert all the new 18-bit AO channels to 16-bit. I was able to locate four unused 16-bit DACs around the 40m [16185], with three of the four found to be working. I was also able to obtain three spare 16-bit DAC adapter boards from Todd Etzel. With the addition of the three working DACs, we ended up with just enough hardware to complete both systems.

The final configuration of each I/O chassis is as follows. The full setup is pictured in Attachment 1.

  C1BHD C1SUS2
Component Qty Installed Qty Installed
16-bit ADC 1 2
16-bit ADC adapter 1 2
16-bit DAC 1 3
16-bit DAC adapter 1 3
16-channel BIO 1 1
32-channel BO 0 6

This hardware provides the following breakdown of channels available to user models:

  C1BHD C1SUS2
Channel Type Channel Count Channel Count
16-bit AI* 31 63
16-bit AO 16 48
BO 0 192

*The last channel of the first ADC is reserved for timing diagnostics.

The chassis have been closed up and their permanent signal cabling installed. They do not need to be reopened, unless future testing finds a problem.

RCG Model Configuration

An IOP model has been created for each system reflecting its final hardware configuration. The IOP models are permanent and system-specific. When ready to install the new systems, the IOP models should be copied to the 40m network drive and installed following the RCG-compilation procedure in [15979]. Each system also has one temporary user model which was set up for testing purposes. These user models will be replaced with the actual SUS, OMC, and BHD models when the new systems are installed.

The current RCG models and the action to take with each one are listed below:

Model Name Host CPU DCUID Path (all paths local to chiara clone machine) Action
c1x06 c1bhd 1 23 /opt/rtcds/userapps/release/cds/c1/models/c1x06.mdl Copy to same location on 40m network drive; compile and install
c1x07 c1sus2 1 24 /opt/rtcds/userapps/release/cds/c1/models/c1x07.mdl Copy to same location on 40m network drive; compile and install
c1bhd c1bhd 2 25 /opt/rtcds/userapps/release/isc/c1/models/c1bhd.mdl Do not copy; replace with permanent OMC/BHD model(s)
c1su2 c1su2 2 26 /opt/rtcds/userapps/release/sus/c1/models/c1su2.mdl Do not copy; replace with permanent SUS model(s)

Each front-end can support up to four user models.

Future Signal-Integrity Testing

Recently, the CDS group has released a well-documented procedure for testing General Standards ADC and DACs: T2000188. They've also automated the tests using a related set of shell scripts (T2000203). Unfortnately I don't believe these scripts will work at the 40m, as they require the latest v4.x RCG.

However, there is an accompanying set of DTT templates that could be very useful for accelerating the testing. They are available from the LIGO SVN (log in with username: "first.last@LIGO.ORG"). I believe these can be used almost directly, with only minor updates to channel names, etc. There are two classes of DTT-templated tests:

  1. DAC -> ADC loopback transfer functions
  2. Voltage noise floor PSD measurements of individual cards

The T2000188 document contains images of normal/passing DTT measurements, as well as known abnormalities and failure modes. More sophisticated tests could also be configured, using these templates as a guiding example.

Hardware Reordering

Due to the unexpected change from 18- to 16-bit AO, we are now short on several pieces of hardware:

  • 16-bit AI chassis. We originally ordered five of these chassis, and all are obligated as replacements within the existing system. Four of them are now (temporarily) in use in the front-end test stand. Thus four of the new 18-bit AI chassis will need to be retrofitted with 16-bit hardware.
  • 16-bit DACs. We currently have exactly enough DACs. I have requested a quote from General Standards for two additional units to have as spares.
  • 16-bit DAC adapters. I have asked Todd Etzel for two additional adapter boards to also have as spares. If no more are available, a few more should be fabricated.
  16230   Wed Jun 30 14:09:26 2021 Ian MacMillanUpdateCDSSUS simPlant model

I have looked at my code from the previous plot of the transfer function and realized that there is a slight error that must be fixed before we can analyze the difference between the theoretical transfer function and the measured transfer function.

The theoretical transfer function, which was generated from Photon has approximately 1000 data points while the measured one has about 120. There are no points between the two datasets that have the same frequency values, so they are not directly comparable. In order to compare them I must infer the data between the points. In the previous post [16195] I expanded the measured dataset. In other words: I filled in the space between points linearly so that I could compare the two data sets. Using this code:

#make values for the comparison
tck_mag = splrep(tst_f, tst_mag) # get bspline representation given (x,y) values
gen_mag = splev(sim_f, tck_mag) # generate intermediate values
dif_mag=[]
for x in range(len(gen_mag)):
    dif_mag.append(gen_mag[x]-sim_mag[x]) # measured minus predicted

tck_ph = splrep(tst_f, tst_ph) # get bspline representation given (x,y) values
gen_ph = splev(sim_f, tck_ph) # generate intermediate values
dif_ph=[]
for x in range(len(gen_ph)):
    dif_ph.append(gen_ph[x]-sim_ph[x])

At points like a sharp peak where the measured data set was sparse compared to the peak, the difference would see the difference between the intermediate “measured” values and the theoretical ones, which would make the difference much higher than it really was.

To fix this I changed the code to generate the intermediate values for the theoretical data set. Using the code here:

tck_mag = splrep(sim_f, sim_mag) # get bspline representation given (x,y) values
gen_mag = splev(tst_f, tck_mag) # generate intermediate values
dif_mag=[]
for x in range(len(tst_mag)):
    dif_mag.append(tst_mag[x]-gen_mag[x])#measured minus predicted

tck_ph = splrep(sim_f, sim_ph) # get bspline representation given (x,y) values
gen_ph = splev(tst_f, tck_ph) # generate intermediate values
dif_ph=[]
for x in range(len(tst_ph)):
    dif_ph.append(tst_ph[x]-gen_ph[x])

Because this dataset has far more values (about 10 times more) the previous problem is not such an issue. In addition, there is never an inferred measured value used. That makes it more representative of the true accuracy of the real transfer function.

This is an update to a previous plot, so I am still using the same data just changing the way it is coded. This plot/data does not have a Q of 1000. That plot will be in a later post along with the error estimation that we talked about in this week’s meeting.

The new plot is shown below in attachment 1. Data and code are contained in attachment 2

  16243   Fri Jul 9 18:35:32 2021 YehonathanUpdateCDSOpto-isolator for c1auxey

Following Koji's channel list review, we made changes to the wiring spreadsheet.

Today, I made the changes real in the Acromag chassis. I went through the channel list one by one and made sure it is wired correctly. Additionally, since we now need all the channels the existing isolators have, I replaced the isolator with the defective channel with a new one.

The things to do next:

1. Create entries for the spare coil driver and satellite box channels in the EPICs DB.

2. Test the spare channels.

  16244   Mon Jul 12 18:06:25 2021 YehonathanUpdateCDSOpto-isolator for c1auxey

I edited /cvs/cds/caltech/target/c1auxey1/ETMYaux.db (after creating a backup) and added the spare coil driver channels.

I tested those channels using caget while fixing wiring issues. The tests were all succesful. The digital output channel were tested using the Windows machine since they are locked by some EPICs mechanism I don't yet understand.

One worrying point is I found that the differential analog inputs to be unstable unless I connected a reference to some stable voltage source unlike previous tests showed. It was unstable (but less) even when I connected the ref to the ground connectors on the power supplies on the workbench. This is really puzzling.

When I say unstable I mean that most of the time the voltage reading shows the right value, but occasionly there is a transient sharp volage drop of the order of 0.5V. I will do a more quantitative analysis tomorrow.

 

  16263   Wed Jul 28 12:47:52 2021 YehonathanUpdateCDSOpto-isolator for c1auxey

To simulate a differential output I used two power supplies connected in series. The outer connectors were used as the outputs and the common connector was connected to the ground and used as a reference. I hooked these outputs to one of the differential analog channels and measured it over time using Striptool. The setup is shown in attachment 3.

I tested two cases: With reference disconnected (attachment 1), and connected (attachment 2). Clearly, the non-referred case is way too noisy.

  16276   Wed Aug 11 12:06:40 2021 YehonathanUpdateCDSOpto-isolator for c1auxey

I redid the differential input experiment using the DS360 function generator we recently got. I generated a low frequency (0.1Hz) sine wave signal with an amplitude 0.5V and connected the + and - output to a differential input on the new c1auxcey Acromag chassis. I recorded a time series of the corresponding EPICS channel with and without the common on the DS360 connected to the Ref connector on the Acromag unit. The common connector on the DS360 is not normally grounded (there is a few tens of kohms between the ground and common connectors). The attachment shows that, indeed, the analog input readout is extremely noisy with the Ref being disconnected. The point where the Ref was connected to common is marked in the picture.

Conclusion: Ref connector on the analog input Acromag units must be connected to some stable voltage source for normal operation.

  16280   Mon Aug 16 23:30:34 2021 PacoUpdateCDSAS WFS commissioning; restarting models

[koji, ian, tega, paco]

With the remote/local assistance of Tega/Ian last friday I made changes on the c1sus model by connecting the C1:ASC model outputs (found within a block in c1ioo) to the BS and PRM suspension inputs (pitch and yaw). Then, Koji reviewed these changes today and made me notice that no changes are actually needed since the blocks were already in place, connected in the right ports, but the model probably just wasn't rebuilt...

So, today we ran "rtcds make", "rtcds install" on the c1ioo and c1sus models (in that order) but the whole system crashed. We spent a great deal of time restarting the machines and their processes but we struggled quite a lot with setting up the right dates to match the GPS times. What seemed to work in the end was to follow the format of the date in the fb1 machine and try to match the timing to the sub-second level. This is especially tricky when performed by a human action so the whole task is tedious. We anyways completed the reboot for almost all the models except the c1oaf (which tends to make things crashy) since we won't need it right away for the tasks ahead. One potential annoying issue we found was in manually rebooting c1iscey because one of its network ports is loose (the ethernet cable won't click in place) and it appears to use this link to boot (!!) so for a while this machine just wasn't coming back up.

Finally, as we restored the suspension controls and reopened the shutters, we noticed a great deal of misalignment to the point no reflected beam was coming back to the RFPD table. So we spent some time verifying the PRM alignment and TT1 and TT2 (tip tilts) and it turned out to be mostly the latter pair that were responsible for it. We used the green beams to help optimize the XARM and YARM transmissions and were able to relock the arms. We ran ASS on them, and then aligned the PRM OpLevs which also seemed off. This was done by giving a pitch offset to the input PRM oplev beam path and then correcting for it downstream (before the qpd). We also adjusted the BS OpLev in the end.


Summary; the ASC BS and PRM outputs are now built into the SUS models. Let the AS WFS loops be closed soon!


Addenda by KA
- Upon the RTS restarting,

  • Date/Time adjustment
    sudo date --set='xxxxxx'
  • If the time on the CDS status medm screen for each IOP match with the FB local time, we ran
    rtcds start c1x01
    (or c1x02, etc)
  • Every time we restart the IOPs, fb was restarted by
    telnet fb1 8083
    > shutdown

    and restarted mx_stream from the CDS screen because these actions change the "DC" status.

- Today we once succeeded to restart the vertex machines. However, the RFM signal transmission did fail. So the end two machines were power cycled as well as c1rfm, but this made all the machines in RED again. Hell...

- We checked the PRM oplev. The spot was around the center but was clipped. This made us so confused. Our conclusion was that the oplev was like that before the RTS reboot.

  16283   Thu Aug 19 03:23:00 2021 AnchalUpdateCDSTime synchornization not running

I tried to read a bit and understand the NTP synchronization implementation in FE computers. I'm quite sure that NTP synchronization should be 'yes' if timesyncd are running correctly in the output of timedatectl in these computers. As Koji reported in 15791, this is not the case. I logged into c1lsc, c1sus and c1ioo and saw that RTC has drifted from the software clocks too which does not happen if NTP synchronization was active. This would mean that almost certainly, if the computers are rebooted, the synchronization will be lost and the models will fail to come online.

My current findings are the following (this should be documented in wiki once we setup everything):

  • nodus is running a NTP server using chronyd. One can check the configuration of this NTP serer in /etc/chornyd.conf
  • fb1 is running an NTP server using ntpd that follows nodus and an IP address 131.215.239.14. This can be seen in /etc/ntp.conf.
  • There are no comments to describe what this other server (131.215.239.14) is. Does the GC network have an NTP server too?
  • c1lsc, c1sus and c1ioo all have systemd-timesyncd.service running with configuration file in /etc/systemd/timesyncd.conf.
  • The configuration file set Servers=ntpserver but echo $ntpserver produces nothing (blank) on these computers and I've been unable to find anyplace where ntpserver is defined.
  • In chiara (our name server), the name server file /etc/hosts does not have any entry for ntpserver either.
  • I think the problem might be that these computers are unable to find the ntpserver as it is not defined anywhere.

The solution to this issue could be as simple as just defining ntpserver in the name server list. But I'm not sure if my understanding of this issue is correct. Comments/suggestions are welcome for future steps.

 

  16284   Thu Aug 19 14:14:49 2021 KojiUpdateCDSTime synchornization not running

131.215.239.14 looks like Caltech's NTP server (ntp-02.caltech.edu)
https://webmagellan.com/explore/caltech.edu/28415b58-837f-4b46-a134-54f4b81bee53

I can't say it is correct or not as I did not make the survey at your level. I think you need a few tests of reconfiguring and restarting the NTP clients to see if time synchronization starts. Because the local time is not regulated right now anyway, this operation is safe I think.

 

  16285   Fri Aug 20 00:28:55 2021 AnchalUpdateCDSTime synchornization not running

I added ntpserver as a known host name for address 192.168.113.201 (fb1's address where ntp server is running) in the martian host list in the following files in Chiara:

/var/lib/bind/martian.hosts
/var/lib/bind/rev.113.168.192.in-addr.arpa

Note: a host name called ntp was already defined at 192.168.113.11 but I don't know what computer this is.

Then, I restarted the DNS on chiara by doing:

sudo service bind9 restart

Then I logged into c1lsc and c1ioo and ran following:

controls@c1ioo:~ 0$ sudo systemctl restart systemd-timesyncd.service

controls@c1ioo:~ 0$ sudo systemctl status systemd-timesyncd.service -l
● systemd-timesyncd.service - Network Time Synchronization
   Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled)
   Active: active (running) since Fri 2021-08-20 07:24:03 UTC; 53s ago
     Docs: man:systemd-timesyncd.service(8)
 Main PID: 23965 (systemd-timesyn)
   Status: "Idle."
   CGroup: /system.slice/systemd-timesyncd.service
           └─23965 /lib/systemd/systemd-timesyncd

Aug 20 07:24:03 c1ioo systemd[1]: Starting Network Time Synchronization...
Aug 20 07:24:03 c1ioo systemd[1]: Started Network Time Synchronization.
Aug 20 07:24:03 c1ioo systemd-timesyncd[23965]: Using NTP server 192.168.113.201:123 (ntpserver).
Aug 20 07:24:35 c1ioo systemd-timesyncd[23965]: Using NTP server 192.168.113.201:123 (ntpserver).
controls@c1ioo:~ 0$ timedatectl
      Local time: Fri 2021-08-20 07:25:28 UTC
  Universal time: Fri 2021-08-20 07:25:28 UTC
        RTC time: Fri 2021-08-20 07:25:31
       Time zone: Etc/UTC (UTC, +0000)
     NTP enabled: yes
NTP synchronized: no
 RTC in local TZ: no
      DST active: n/a

The same output is shown in c1lsc too. The NTP synchronized flag in output of timedatectl command did not change to yes and the RTC is still 3 seconds ahead of the local clock.

Then I went to c1sus to see what was the status output before rstarting the timesyncd service. I got folloing output:

controls@c1sus:~ 0$ sudo systemctl status systemd-timesyncd.service -l
● systemd-timesyncd.service - Network Time Synchronization
   Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled)
   Active: active (running) since Tue 2021-08-17 04:38:03 UTC; 3 days ago
     Docs: man:systemd-timesyncd.service(8)
 Main PID: 243 (systemd-timesyn)
   Status: "Idle."
   CGroup: /system.slice/systemd-timesyncd.service
           └─243 /lib/systemd/systemd-timesyncd

Aug 20 02:02:18 c1sus systemd-timesyncd[243]: Using NTP server 192.168.113.201:123 (ntpserver).
Aug 20 02:36:27 c1sus systemd-timesyncd[243]: Using NTP server 192.168.113.201:123 (ntpserver).
Aug 20 03:10:35 c1sus systemd-timesyncd[243]: Using NTP server 192.168.113.201:123 (ntpserver).
Aug 20 03:44:43 c1sus systemd-timesyncd[243]: Using NTP server 192.168.113.201:123 (ntpserver).
Aug 20 04:18:51 c1sus systemd-timesyncd[243]: Using NTP server 192.168.113.201:123 (ntpserver).
Aug 20 04:53:00 c1sus systemd-timesyncd[243]: Using NTP server 192.168.113.201:123 (ntpserver).
Aug 20 05:27:08 c1sus systemd-timesyncd[243]: Using NTP server 192.168.113.201:123 (ntpserver).
Aug 20 06:01:16 c1sus systemd-timesyncd[243]: Using NTP server 192.168.113.201:123 (ntpserver).
Aug 20 06:35:24 c1sus systemd-timesyncd[243]: Using NTP server 192.168.113.201:123 (ntpserver).
Aug 20 07:09:33 c1sus systemd-timesyncd[243]: Using NTP server 192.168.113.201:123 (ntpserver).

This actually shows that the service was able to find ntpserver correctly at 192.168.113.201 even before I changed the name server file in chiara. So I'm retracting the changes made to name server. They are probably not required.

The configuration files for timesynd.conf are read only even with sudo. I tried changing permissions but that did not work either. Maybe these files are not correctly configured. The man page of timesyncd  says to use field 'NTP' to give the ntp servers. Our files are using field 'Servers'. But since we are not getting any error message, I don't think this is the issue here.

I'll look more into this problem.

  16286   Fri Aug 20 06:24:18 2021 AnchalUpdateCDSTime synchornization not running

I read on some stack exchange that 'NTP synchornized' indicator turns 'yes' in the output of command timedatectl only when RTC clock has been adjusted at some point. I also read that timesyncd does not do the change if the time difference is too much, roughly more than 3 seconds.

So I logged into all FE machines and ran sudo hwclock -w to synchronize them all to the system clocks and then waited if the timesyncd does any correction on RTC. It did not. A few hours later, I found the RTC clocks drifitng again from the system clocks. So even if the timesynd service is running as it should, it si not performing time correction for whatever reason.

Maybe we should try to use some other service?

Quote:
 

The NTP synchronized flag in output of timedatectl command did not change to yes and the RTC is still 3 seconds ahead of the local clock.

 

  16289   Mon Aug 23 15:25:59 2021 Ian MacMillanUpdateCDSSUS simPlant model

I am adding a State-space block to the SimPlant cds model using the example Chris gave. I made a new folder in controls called SimPlantStateSpace. wI used the code below to make a state-space LTI model with a 1D pendulum then I converted it to a discrete system using c2d matlab function. Then I used these in the rtss.m file to create the state space code I need in the SimPlantStateSpace_1D_model.h file.

%sys_model.m

Q = 1000;
phi = 1/Q;
g = 9.806;
m = 0.24; % mass of pendulum
l = 0.248; %length of pendulum
w_0 = sqrt(g/l);

f=16000 %this is the frequency of the channel that will be used

A = [0 1; -w_0^2*(1+1/Q*1i) -w_0/Q]
B = [0; 1/m];
C = [1 0];
D = [0];
sys_dc = ss(A,B,C,D)

sys=c2d(sys_dc, 1/f)

This code outputs the discrete state space that is added to the header file attached.

  16294   Tue Aug 24 18:44:03 2021 KojiUpdateCDSFB is writing the frames with a year old date

Dan Kozak pointed out that the new frame files of the 40m has not been written in 2021 GPS time but 2020 GPS time.

Current GPS time is 1313890914 (or something like that), but the new files are written as C-R-1282268576-16.gwf

I don't know how this can happen but this may explain why we can't have the agreement between the FB gps time and the RTS gps time.

(dataviewer seems dependent on the FB GPS time and it indicates 2020 date. DTT/diaggui does not.)


This is the way to check the gpstime on fb1. It's apparently a year off.

controls@fb1:~ 0$ cat /proc/gps
1282269402.89

  16297   Wed Aug 25 11:48:48 2021 YehonathanUpdateCDSc1auxey assembly

After confirming that, indeed, leaving the RTN connection floating can cause reliability issues we decided to make these connections in the c1auxex analog input units.

According to Johannes' wiring scheme (excluding the anti-image and OPLEV since they are decommissioned), Acromag unit 1221b accepts analog inputs from two modules. All of these channels are single-ended according to their schematics.

One option is to use the Acromag ground and connect it to the RTNs of both 1221b and 1221c. Another is to connect the minus wire of one module, which is tied to the module's ground, to the RTN. We shouldn't tie the grounds of the different modules together by connecting them to the same RTN point.

We should take some OSEM spectra of the X end arm before and after this work to confirm we didn't produce more noise by doing so. Right now, it is impossible due to issues caused by the recent power surge.

Quote:

{Yehonathan, Jon}

We poked (looked in situ with a flashlight, not disturbing any connections) around c1auxex chassis to understand better what is the wiring scheme.

To our surprise, we found that nothing was connected to the RTNs of the analog input Acromag modules. From previous experience and the Acromag manual, there can't be any meaningful voltage measurement without it.

 

  16298   Wed Aug 25 17:31:30 2021 PacoUpdateCDSFB is writing the frames with a year old date

[paco, tega, koji]

After invaluable assistance from Jamie in fixing this yearly offset in the gps time reported by cat /proc/gps, we managed to restart the real time system correctly (while still manually synchronizing the front end machine times). After this, we recovered the mode cleaner and were able to lock the arms with not much fuss.

Nevertheless, tega and I noticed some weird noise in the C1:LSC-TRX_OUT which was not present in the YARM transmission, and that is present even in the absence of light (we unlocked the arms and just saw it on the ndscope as shown in Attachment #1). It seems to affect the XARM and in general the lock acquisition...

We took some quick spectrum with diaggui (Attachment #2) but it doesn't look normal; there seems to be broadband excess noise with a remarkable 1 kHz component. We will probably look into it in more detail.

  16299   Wed Aug 25 18:20:21 2021 JamieUpdateCDSGPS time on fb1 fixed, dadq writing correct frames again

I have no idea what happened to the GPS timing on fb1, but it seems like the issue was coincident with the power glitch on Monday.

As was noted by Koji above, the GPS time kernel interface was off by a year, which was causing the frame builder to write out files with the wrong names.  fb1 was using DAQD components from the advligorts 3.3 release, which used the old "symmetricom" kernel module for the GPS time.  This old module was also known to have issues with time offsets.  This issue is remniscent of previous timing issues with the DAQ on fb1.

I noted that a newer version of the advligorts, version 3.4, was available on debian jessie, the system running on fb1.  advligorts 3.4 includes a newer version of the GPS time module, renamed gpstime.  I checked with Jonathan Hanks that the interfaces did not change between 3.3 and 3.4, and 3.4 was mostly a bug fix and packaging release, so I decided to upgrade the DAQ to get the new components.  I therefore did the following

  • updated the archive info in /etc/apt/sources.list.d/cdssoft.list, and added the "jessie-restricted" archive which includes the mx packages: https://git.ligo.org/cds-packaging/docs/-/wikis/home

  • removed the symmetricom module from the kernel

    sudo rmmod symmetricom

  • upgraded the advligorts-daqd components (NOTE I did not upgrade the rest of the system, although there are outstanding security upgrades needed):

    sudo apt install advligorts-daqd advligorts-daqd-dc-mx

  • loaded the new gpstime module and checked that the GPS time was correct:

    sudo modprobe gpstime

  • restarted all the daqd processes

    sudo systemctl restart daqd_*

Everything came up fine at that point, and I checked that the correct frames were being written out.

  16300   Thu Aug 26 10:10:44 2021 PacoUpdateCDSFB is writing the frames with a year old date

[paco, ]

We went over the X end to check what was going on with the TRX signal. We spotted the ground terminal coming from the QPD is loosely touching the handle of one of the computers on the rack. When we detached it completely from the rack the noise was gone (attachment 1).

We taped this terminal so it doesn't touch anything accidently. We don't know if this is the best solution since it is probably needs a stable voltage reference. In the Y end those ground terminals are connected to the same point on the rack. The other ground terminals in the X end are just cut.

We also took the PSD of these channels (attachment 2). The noise seem to be gone but TRX is still a bit noisier than TRY. Maybe we should setup a proper ground for the X arm QPD?


We saw that the X end station ALS laser was off. We turned it on and also the crystal oven and reenabled the temperature controller. Green light immidiately appeared. We are now working to restore the ALS lock. After running XARM ASS we were unable to lock the green laser so we went to the XEND and moved the piezo X ALS alignment mirrors until we maximized the transmission in the right mode. We then locked the ALS beams on both arms successfully. It very well could be that the PZT offsets were reset by the power glitch. The XARM ALS still needs some tweaking, its level is ~ 25% of what it was before the power glitch.

  16302   Thu Aug 26 10:30:14 2021 JamieConfigurationCDSfront end time synchronization fixed?

I've been looking at why the front end NTP time synchronization did not seem to be working.  I think it might not have been working because the NTP server the front ends were point to, fb1, was not actually responding to synchronization requests.

I cleaned up some things on fb1 and the front ends, which I think unstuck things.

On fb1:

  • stopped/disabled the default client (systemd-timesyncd), and properly installed the full NTP server (ntp)
  • the ntp server package for debian jessie is old-style sysVinit, not systemd.  In order to make it more integrated I copied the auto-generated service file to /etc/systemd/system/ntp.service, and added and "[install]" section that specifies that it should be available during the default "multi-user.target".
  • "enabled" the new service to auto-start at boot ("sudo systemctl enable ntp.service") 
  • made sure ntp was configured to serve the front end network ('broadcast 192.168.123.255') and then restarted the server ("sudo systemctl restart ntp.service")

For the front ends:

  • on fb1 I chroot'd into the front-end diskless root (/diskless/root) and manually specifed that systemd-timesyncd should start on boot by creating a symlink to the timesyncd service in the multi-user.target directory:
$ sudo chroot /diskless/root
$ cd /etc/systemd/system/multi-user.target.wants
$ ln -s /lib/systemd/system/systemd-timesyncd.service
  • on the front end itself (c1iscex as a test) I did a "systemctl daemon-reload" to force it to reload the systemd config, and then restarted the client ("systemctl restart systemd-timesyncd")
  • checked the NTP synchronization with timedatectl:
controls@c1iscex:~ 0$ timedatectl 
      Local time: Thu 2021-08-26 11:35:10 PDT
  Universal time: Thu 2021-08-26 18:35:10 UTC
        RTC time: Thu 2021-08-26 18:35:10
       Time zone: America/Los_Angeles (PDT, -0700)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: yes
 Last DST change: DST began at
                  Sun 2021-03-14 01:59:59 PST
                  Sun 2021-03-14 03:00:00 PDT
 Next DST change: DST ends (the clock jumps one hour backwards) at
                  Sun 2021-11-07 01:59:59 PDT
                  Sun 2021-11-07 01:00:00 PST
controls@c1iscex:~ 0$ 

Note that it is now reporting "NTP enabled: yes" (the service is enabled to start at boot) and "NTP synchronized: yes" (synchronization is happening), neither of which it was reporting previously.  I also note that the systemd-timesyncd client service is now loaded and enabled, is no longer reporting that it is in an "Idle" state and is in fact reporting that it synchronized to the proper server, and it is logging updates:

controls@c1iscex:~ 0$ sudo systemctl status systemd-timesyncd
â— systemd-timesyncd.service - Network Time Synchronization
   Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled)
   Active: active (running) since Thu 2021-08-26 10:20:11 PDT; 1h 22min ago
     Docs: man:systemd-timesyncd.service(8)
 Main PID: 2918 (systemd-timesyn)
   Status: "Using Time Server 192.168.113.201:123 (ntpserver)."
   CGroup: /system.slice/systemd-timesyncd.service
           â””─2918 /lib/systemd/systemd-timesyncd

Aug 26 10:20:11 c1iscex systemd[1]: Started Network Time Synchronization.
Aug 26 10:20:11 c1iscex systemd-timesyncd[2918]: Using NTP server 192.168.113.201:123 (ntpserver).
Aug 26 10:20:11 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 64s/+0.000s/0.000s/0.000s/+26ppm
Aug 26 10:21:15 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 128s/-0.000s/0.000s/0.000s/+25ppm
Aug 26 10:23:23 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 256s/+0.001s/0.000s/0.000s/+26ppm
Aug 26 10:27:40 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 512s/+0.003s/0.000s/0.001s/+29ppm
Aug 26 10:36:12 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 1024s/+0.008s/0.000s/0.003s/+33ppm
Aug 26 10:53:16 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 2048s/-0.026s/0.000s/0.010s/+27ppm
Aug 26 11:27:24 c1iscex systemd-timesyncd[2918]: interval/delta/delay/jitter/drift 2048s/+0.009s/0.000s/0.011s/+29ppm
controls@c1iscex:~ 0$ 

So I think this means everything is working.

I then went ahead and reloaded and restarted the timesyncd services on the rest of the front ends.

We still need to confirm that everything comes up properly the next time we have an opportunity to reboot fb1 and the front ends (or the opportunity is forced upon us).

There was speculation that the NTP clients on the front ends (systemd-timesyncd) would not work on a read-only filesystem, but this doesn't seem to be true.  You can't trust everything you read on the internet.

  16309   Thu Sep 2 19:47:38 2021 KojiUpdateCDSThis week's FB1 GPS Timing Issue Solved

After the reboot daqd_dc was not working, but manual starting of open-mx / mx services solved the issue.

sudo systemctl start open-mx.service
sudo systemctl start mx.service
sudo systemctl start daqd_*

 

  16310   Thu Sep 2 20:44:18 2021 KojiUpdateCDSChiara DHCP restarted

We had the issue of the RT machines rebooting. Once we hooked up the display on c1iscex, it turned out that the IP was not given at it's booting-up.

I went to chiara and confirmed that the DHCP service was not running

~>sudo service isc-dhcp-server status
[sudo] password for controls:
isc-dhcp-server stop/waiting

So the DHCP service was manually restarted

~>sudo service isc-dhcp-server start
isc-dhcp-server start/running, process 24502
~>sudo service isc-dhcp-server status
isc-dhcp-server start/running, process 24502

 

 

  16311   Thu Sep 2 20:47:19 2021 KojiUpdateCDSChiara DHCP restarted

[Paco, Tega, Koji]

Once chiara's DHCP is back, things got much more straight forward.
c1iscex and c1iscey were rebooted and the IOPs were launched without any hesitation.

Paco ran rebootC1LSC.sh and for the first time in this year we had the launch of the processes without any issue.

  16321   Mon Sep 13 14:32:25 2021 YehonathanUpdateCDSc1auxey assembly

So we agreed that the RTNs points on the c1auxex Acromag chassis should just be grounded to the local Acromag ground as it just needs a stable reference. Normally, the RTNs are not connected to any ground so there is should be no danger of forming ground loops by doing that. It is probably best to use the common wire from the 15V power supplies since it also powers the VME crate. I took the spectra of the ETMX OSEMs (attachment) for reference and proceeding with the grounding work.

 

  16325   Tue Sep 14 15:57:05 2021 jamieFrogsCDSfb1 /var full after reboot, caused all sorts of problems

/var on fb1 filled up today, which caused all sorts of CDS issues.  I found out about the problem by reading the logs of the services that were having trouble running, in which they complained about not being able to write to disk.  I looked at the filesystem status with 'df' and noticed that /var was full, which is where applications write temporary data, and will always cause problems if it's full.

I tracked the issue down to multiple multi-gigabyte log files: /var/log/messages and /var/log/messages.1.  They were full of lines like this one:

Aug 29 06:25:21 fb1 kernel: l called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl called cmd = 1gpstime iotcl ca

Seems like something related to the gpstime kernel module?

Anyway, I deleted the log files for now, which cleared up the space on /var.  Things should be back to normal now, until the logs fill up again...

  16327   Tue Sep 14 16:44:54 2021 jamieFrogsCDSfb1 /var full after reboot, caused all sorts of problems

Jonathan Hanks pointed me to this fix to the gpstime kernel module that was unfortunately put in after the 3.4 release that we're currently using:

https://git.ligo.org/cds/advligorts/-/commit/6f6d6e2eb1d3355d0cbfe9fe31ea3b59af1e7348

I hacked the source in place (/usr/src/gpstime-3.4/drv/gpstime/gpstime.c) to get the fix, and then rebuilt the kernel module with dkms :

sudo dkms uninstall gpstime/3.4
sudo dkms install gpstime/3.4

I then stopped daqd_dc, unloaded gpstime, reloaded it, restarted daqd_dc.  The messages are no longer showing up in /var/log/messages, so I think we're ok for the moment.

NOTE: the fix will be undone if we for some reason reinstall the advligorts-gpstime-dkms package.  There shouldn't be a need to do that, but we should be aware.  I'm discussing with Jonathan if we want to try to push out a new debian package to fix this issue...

  16330   Tue Sep 14 17:22:21 2021 AnchalUpdateCDSAdded temp sensor channels to DAQ list

[Tega, Paco, Anchal]

We attempted to reboot fb1 daqd today to get the new temperature sensor channels recording. However, the FE models got stuck, apparantely due to reasons explaine din 40m/16325. Jamie cleared the /var/logs in fb1 so that FE can reboot. We were able to reboot the FE machines after this work successfully and get the models running too. During the day, the FE machines were shut down manually and brought back on manually, a couple of times on the c1iscex machine. Only change in fb1 is in the /opt/rtcds/caltech/c1/chans/daq/C0EDCU.ini where the new channels were added, and some hacking was done by Jamie in gpstime module (See 40m/16327).

  16332   Wed Sep 15 11:27:50 2021 YehonathanUpdateCDSc1auxey assembly

{Yehonathan, Paco}

We turned off the ETMX watchdogs and OpLevs. We went to the X end and shut down the Acromag chassi. We labeled the chassi feedthroughs and disconnected all the cables from it.

We took it out and tied the common wire of the power supplies (the commons of the 20V and 15V power supplies were shorted so there is no difference which we connect) to the RTNs of the analog inputs.

The chassi was put back in place. All the cables were reconnected. Power turn on.

We rebooted c1auxex and the channels went back online. We turned on the watchdogs and watched the ETMX motion get damped. We turned on the OpLev. We waited until the beam position got centered on the ETMX.

Attachment shows a comparison between the OSEM spectra before and after the grounding work. Seems like there is no change.

We were able to lock the arms with no issues.

 

  16351   Tue Sep 21 11:09:34 2021 AnchalSummaryCDSXARM YARM UGF Servo and Oscillators added

I've updated the c1LSC simulink model to add the so-called UGF servos in the XARM and YARM single arm loops as well. These were earlier present in DARM, CARM, MICH and PRCL loops only. The UGF servo themselves serves a larger purpose but we won't be using that. What we have access to now is to add an oscillator in the single arm and get realtime demodulated signal before and after the addition of the oscillator. This would allow us to get the open loop transfer function and its uncertaintiy at particular frequencies (set by the oscillator) and would allow us to create a noise budget on the calibration error of these transfer functions.

 

The new model has been committed locally in the 40m/RTCDSmodels git repo. I do not have rights to push to the remote in git.ligo. The model builds, installs and starts correctly.

  16354   Wed Sep 22 12:40:04 2021 AnchalSummaryCDSXARM YARM UGF Servo and Oscillators shifted to OAF

To reduce burden on c1lsc, I've shifted the added UGF block to to c1oaf model. c1lsc had to be modified to allow addition of an oscillator in the XARm and YARM control loops and take out test points before and after the addition to c1oaf through shared memory IPC to do realtime demodulation in c1oaf model.

The new models built and installed successfully and I've been able to recover both single arm locks after restarting the computers.

 

  16365   Wed Sep 29 17:10:09 2021 AnchalSummaryCDSc1teststand problems summary

[anchal, ian]

We went and collected some information for the overlords to fix the c1teststand DAQ network issue.


  • from c1teststand, c1bhd and c1sus2 computers were not accessible through ssh. (No route to host). So we restarted both the computers (the I/O chassis were ON).
  • After the computers restarted, we were able to ssh into c1bhd and c1sus, ad we ran rtcds start c1x06 and rtcds start c1x07.
  • The first page in attachment shows the screenshot of GDS_TP screens of the IOP models after this step.
  • Then we started teh user models by running rtcds start c1bhd and rtcds start c1su2.
  • The second page shows the screenshot of GDS_TP screens. You can notice that DAQ status is red in all the screens and the DC statuses are blank.
  • So we checked if daqd_ services are running in the fb computer. They were not. So we started them all by sudo systemctl start daqd_*.
  • Third page shows the status of all services after this step. the daqd_dc.service remained at failed state.
  • open-mx_stream.service was not even loaded in fb. We started it by running sudo systemctl start open-mx_stream.service.
  • The fourth page shows the status of this service. It started without any errors.
  • However, when we went to check the status of mx_stream.service in c1bhd and c1sus2, they were not loaded and we we tried to start them, they showed failed state and kept trying to start every 3 seconds without success. (See page 5 and 6).
  • Finally, we also took a screenshot of timedatectl command output on the three computers fb, c1bhd, and c1sus2 to show that their times were not synced at all.
  • The ntp service is running on fb but it probably does not have access to any of the servers it is following.
  • The timesyncd on c1bhd and c1sus2 (FE machines) is also running but showing status 'Idle' which suggested they are unable to find the ntp signal from fb.
  • I believe this issue is similar to what jamie ficed in the fb1 on martian network in 40m/16302. Since the fb on c1teststand network was cloned before this fix, it might have this dysfunctional ntp as well.

We would try to get internet access to c1teststand soon. Meanwhile, someone with more experience and knowledge should look into this situation and try to fix it. We need to test the c1teststand within few weeks now.

  16367   Thu Sep 30 14:09:37 2021 AnchalSummaryCDSNew way to ssh into c1teststand

Late elog, original time Wed Sep 29 14:09:59 2021

We opened a new port (22220) in the router to the martian subnetwork which is forwarded to port 22 on c1teststand (192.168.113.245) allowing direct ssh access to c1teststand computer from the outside world using:

                                                                       

                                                                                    
 

Checkout this wiki page for unredadcted info.

  16372   Mon Oct 4 11:05:44 2021 AnchalSummaryCDSc1teststand problems summary

[Anchal, Paco]

We tried to fix the ntp synchronization in c1teststand today by repeating the steps listed in 40m/16302. Even though teh cloned fb1 now has the exact same package version, conf & service files, and status, the FE machines (c1bhd and c1sus2) fail to sync to the time. the timedatectl shows the same stauts 'Idle'. We also, dug bit deeper into the error messages of daq_dc on cloned fb1 and mx_stream on FE machines and have some error messages to report here.


Attempt on fixing the ntp

  • We copied the ntp package version 1:4.2.6 deb file from /var/cache/apt/archives/ntp_1%3a4.2.6.p5+dfsg-7+deb8u3_amd64.deb on the martian fb1 to the cloned fb1 and ran.
    controls@fb1:~ 0$ sudo dbpg -i ntp_1%3a4.2.6.p5+dfsg-7+deb8u3_amd64.deb
  • We got error messages about missing dependencies of libopts25 and libssl1.1. We downloaded oldoldstable jessie versions of these packages from here and here. We ensured that these versions are higher than the required versions for ntp. We installed them with:
    controls@fb1:~ 0$ sudo dbpg -i libopts25_5.18.12-3_amd64.deb 
    controls@fb1:~ 0$ sudo dbpg -i libssl1.1_1.1.0l-1~deb9u4_amd64.deb
  • Then we installed the ntp package as described above. It asked us if we want to keep the configuration file, we pressed Y.
  • However, we decided to make the configuration and service files exactly same as martian fb1 to make it same in cloned fb1. We copied /etc/ntp.conf and /etc/systemd/system/ntp.service files from martian fb1 to cloned fb1 in the same positions. Then we enabled ntp, reloaded the daemon, and restarted ntp service:
    controls@fb1:~ 0$ sudo systemctl enable ntp
    controls@fb1:~ 0$ sudo systemctl daemon-reload
    controls@fb1:~ 0$ sudo systemctl restart ntp
  • But ofcourse, since fb1 doesn't have internet access, we got some errors in status of the ntp.service:
    controls@fb1:~ 0$ sudo systemctl status ntp
    ● ntp.service - NTP daemon (custom service)
       Loaded: loaded (/etc/systemd/system/ntp.service; enabled)
       Active: active (running) since Mon 2021-10-04 17:12:58 UTC; 1h 15min ago
     Main PID: 26807 (code=exited, status=0/SUCCESS)
       CGroup: /system.slice/ntp.service
               ├─30408 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:107
               └─30525 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:107
    
    Oct 04 17:48:42 fb1 ntpd_intres[30525]: host name not found: 2.debian.pool.ntp.org
    Oct 04 17:48:52 fb1 ntpd_intres[30525]: host name not found: 3.debian.pool.ntp.org
    Oct 04 18:05:05 fb1 ntpd_intres[30525]: host name not found: 0.debian.pool.ntp.org
    Oct 04 18:05:15 fb1 ntpd_intres[30525]: host name not found: 1.debian.pool.ntp.org
    Oct 04 18:05:25 fb1 ntpd_intres[30525]: host name not found: 2.debian.pool.ntp.org
    Oct 04 18:05:35 fb1 ntpd_intres[30525]: host name not found: 3.debian.pool.ntp.org
    Oct 04 18:21:48 fb1 ntpd_intres[30525]: host name not found: 0.debian.pool.ntp.org
    Oct 04 18:21:58 fb1 ntpd_intres[30525]: host name not found: 1.debian.pool.ntp.org
    Oct 04 18:22:08 fb1 ntpd_intres[30525]: host name not found: 2.debian.pool.ntp.org
    Oct 04 18:22:18 fb1 ntpd_intres[30525]: host name not found: 3.debian.pool.ntp.org
  • But the ntpq command is giving the saem output as given by ntpq comman in martian fb1 (except for the source servers), that the broadcasting is happening in the same manner:
    controls@fb1:~ 0$ ntpq -p
         remote           refid      st t when poll reach   delay   offset  jitter
    ==============================================================================
     192.168.123.255 .BCST.          16 u    -   64    0    0.000    0.000   0.000
    
  • On the FE machines side though, the systemd-timesyncd are still unable to read the time signal from fb1 and show the status as idle:
    controls@c1bhd:~ 3$ timedatectl
          Local time: Mon 2021-10-04 18:34:38 UTC
      Universal time: Mon 2021-10-04 18:34:38 UTC
            RTC time: Mon 2021-10-04 18:34:38
           Time zone: Etc/UTC (UTC, +0000)
         NTP enabled: yes
    NTP synchronized: no
     RTC in local TZ: no
          DST active: n/a
    controls@c1bhd:~ 0$ systemctl status systemd-timesyncd -l
    ● systemd-timesyncd.service - Network Time Synchronization
       Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled)
       Active: active (running) since Mon 2021-10-04 17:21:29 UTC; 1h 13min ago
         Docs: man:systemd-timesyncd.service(8)
     Main PID: 244 (systemd-timesyn)
       Status: "Idle."
       CGroup: /system.slice/systemd-timesyncd.service
               └─244 /lib/systemd/systemd-timesyncd
  • So the time synchronization is still not working. We expected the FE machined to just synchronize to fb1 even though it doesn't have any upstream ntp server to synchronize to. But that didn't happen.
  • I'm (Anchal) working on getting internet access to c1teststand computers.

Digging into mx_stream/daqd_dc errors:

  • We went and changed the Restart fileld in /etc/systemd/system/daqd_dc.service on cloned fb1 to 2. This allows the service to fail and stop restarting after two attempts. This allows us to see the real error message instead of the systemd error message that the service is restarting too often. We got following:
    controls@fb1:~ 3$ sudo systemctl status daqd_dc -l
    ● daqd_dc.service - Advanced LIGO RTS daqd data concentrator
       Loaded: loaded (/etc/systemd/system/daqd_dc.service; enabled)
       Active: failed (Result: exit-code) since Mon 2021-10-04 17:50:25 UTC; 22s ago
      Process: 715 ExecStart=/usr/bin/daqd_dc_mx -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.dc (code=exited, status=1/FAILURE)
     Main PID: 715 (code=exited, status=1/FAILURE)
    
    Oct 04 17:50:24 fb1 systemd[1]: Started Advanced LIGO RTS daqd data concentrator.
    Oct 04 17:50:25 fb1 daqd_dc_mx[715]: [Mon Oct  4 17:50:25 2021] Unable to set to nice = -20 -error Unknown error -1
    Oct 04 17:50:25 fb1 daqd_dc_mx[715]: Failed to do mx_get_info: MX not initialized.
    Oct 04 17:50:25 fb1 daqd_dc_mx[715]: 263596
    Oct 04 17:50:25 fb1 systemd[1]: daqd_dc.service: main process exited, code=exited, status=1/FAILURE
    Oct 04 17:50:25 fb1 systemd[1]: Unit daqd_dc.service entered failed state.
    
  • It seemed like the only thing daqd_dc process doesn't like is that mx_stream services are in failed state in teh FE computers. So we did the same process on FE machines to get the real error messages:
    controls@fb1:~ 0$ sudo chroot /diskless/root
    fb1:/ 0#
    fb1:/ 0# sudo nano /etc/systemd/system/mx_stream.service
    fb1:/ 0#
    fb1:/ 0# exit
  • Then I ssh'ed into c1bhd to see the error message on mx_stream service properly.
    controls@c1bhd:~ 0$ sudo systemctl daemon-reload
    controls@c1bhd:~ 0$ sudo systemctl restart mx_stream
    controls@c1bhd:~ 0$ sudo systemctl status mx_stream -l
    ● mx_stream.service - Advanced LIGO RTS front end mx stream
       Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
       Active: failed (Result: exit-code) since Mon 2021-10-04 17:57:20 UTC; 24s ago
      Process: 11832 ExecStart=/etc/mx_stream_exec (code=exited, status=1/FAILURE)
     Main PID: 11832 (code=exited, status=1/FAILURE)
    
    Oct 04 17:57:20 c1bhd systemd[1]: Starting Advanced LIGO RTS front end mx stream...
    Oct 04 17:57:20 c1bhd systemd[1]: Started Advanced LIGO RTS front end mx stream.
    Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: send len = 263596
    Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: OMX: Failed to find peer index of board 00:00:00:00:00:00 (Peer Not Found in the Table)
    Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: mx_connect failed Nic ID not Found in Peer Table
    Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: c1x06_daq mmapped address is 0x7f516a97a000
    Oct 04 17:57:20 c1bhd mx_stream_exec[11832]: c1bhd_daq mmapped address is 0x7f516697a000
    Oct 04 17:57:20 c1bhd systemd[1]: mx_stream.service: main process exited, code=exited, status=1/FAILURE
    Oct 04 17:57:20 c1bhd systemd[1]: Unit mx_stream.service entered failed state.
    
  • c1sus2 shows the same error. I'm not sure I understand these errors at all. But they seem to have nothing to do with timing issuessurprise!

As usual, some help would be helpful

  16376   Mon Oct 4 18:00:16 2021 KojiSummaryCDSc1teststand problems summary

I don't know anything about mx/open-mx, but you also need open-mx,don't you?


controls@c1ioo:~ 0$ systemctl status *mx*
● open-mx.service - LSB: starts Open-MX driver
   Loaded: loaded (/etc/init.d/open-mx)
   Active: active (running) since Wed 2021-09-22 11:54:39 PDT; 1 weeks 5 days ago
  Process: 470 ExecStart=/etc/init.d/open-mx start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/open-mx.service
           └─620 /opt/3.2.88-csp/open-mx-1.5.4/bin/fma -d

● mx_stream.service - Advanced LIGO RTS front end mx stream
   Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
   Active: active (running) since Wed 2021-09-22 12:08:00 PDT; 1 weeks 5 days ago
 Main PID: 5785 (mx_stream)
   CGroup: /system.slice/mx_stream.service
           └─5785 /usr/bin/mx_stream -e 0 -r 0 -w 0 -W 0 -s c1x03 c1ioo c1als c1omc -d fb1:0

 

  16381   Tue Oct 5 17:58:52 2021 AnchalSummaryCDSc1teststand problems summary

open-mx service is running successfully on the fb1(clone), c1bhd and c1sus.

Quote:

I don't know anything about mx/open-mx, but you also need open-mx,don't you?


  16382   Tue Oct 5 18:00:53 2021 AnchalSummaryCDSc1teststand time synchronization working now

Today I got a new router that I used to connect the c1teststand, fb1 and chiara. I was able to see internet access in c1teststand and fb1, but not in chiara. I'm not sure why that is the case.

The good news is that the ntp server on fb1(clone) is working fine now and both FE computers, c1bhd and c1sus2 are succesfully synchronized to the fb1(clone) ntpserver. This resolves any possible timing issues in this DAQ network.

On running the IOP and user models however, I see the same errors are mentioned in 40m/16372. Something to do with:

Oct 06 00:47:56 c1sus2 mx_stream_exec[21796]: OMX: Failed to find peer index of board 00:00:00:00:00:00 (Peer Not Found in the Table)
Oct 06 00:47:56 c1sus2 mx_stream_exec[21796]: mx_connect failed Nic ID not Found in Peer Table
Oct 06 00:47:56 c1sus2 mx_stream_exec[21796]: c1x07_daq mmapped address is 0x7fa4819cc000
Oct 06 00:47:56 c1sus2 mx_stream_exec[21796]: c1su2_daq mmapped address is 0x7fa47d9cc000


Thu Oct 7 17:04:31 2021

I fixed the issue of chiara not getting internet. Now c1teststand, fb1 and chiara, all have internet connections. It was the issue of default gateway and interface and findiing the DNS. I have found the correct settings now.

  16391   Mon Oct 11 17:31:25 2021 AnchalSummaryCDSFixed mounting of mx devices in fb. daqd_dc is running now.
 
 

However, lspci | grep 'Myri' shows following output on both computers:

controls@fb1:/dev 0$ lspci | grep 'Myri'
02:00.0 Ethernet controller: MYRICOM Inc. Myri-10G Dual-Protocol NIC (rev 01)

Which means that the computer detects the card on PCie slot.

 

I tried to add this to /etc/rc.local to run this script at every boot, but it did not work. So for now, I'll just manually do this step everytime. Once the devices are loaded, we get:

controls@fb1:/etc 0$ ls /dev/*mx*
/dev/mx0  /dev/mx4  /dev/mxctl   /dev/mxp2  /dev/mxp6         /dev/ptmx
/dev/mx1  /dev/mx5  /dev/mxctlp  /dev/mxp3  /dev/mxp7
/dev/mx2  /dev/mx6  /dev/mxp0    /dev/mxp4  /dev/open-mx
/dev/mx3  /dev/mx7  /dev/mxp1    /dev/mxp5  /dev/open-mx-raw

The, restarting all daqd_ processes, I found that daqd_dc was running succesfully now. Here is the status:

controls@fb1:/etc 0$ sudo systemctl status daqd_* -l
● daqd_dc.service - Advanced LIGO RTS daqd data concentrator
   Loaded: loaded (/etc/systemd/system/daqd_dc.service; enabled)
   Active: active (running) since Mon 2021-10-11 17:48:00 PDT; 23min ago
 Main PID: 2308 (daqd_dc_mx)
   CGroup: /daqd.slice/daqd_dc.service
           ├─2308 /usr/bin/daqd_dc_mx -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.dc
           └─2370 caRepeater

Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: mx receiver 006 thread priority error Operation not permitted[Mon Oct 11 17:48:06 2021]
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: mx receiver 005 thread put on CPU 0
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:48:06 2021] [Mon Oct 11 17:48:06 2021] mx receiver 006 thread put on CPU 0
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: mx receiver 007 thread put on CPU 0
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:48:06 2021] mx receiver 003 thread - label dqmx003 pid=2362
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:48:06 2021] mx receiver 003 thread priority error Operation not permitted
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:48:06 2021] mx receiver 003 thread put on CPU 0
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: warning:regcache incompatible with malloc
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:48:06 2021] EDCU has 410 channels configured; first=0
Oct 11 17:49:06 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:49:06 2021] ->4: clear crc

● daqd_fw.service - Advanced LIGO RTS daqd frame writer
   Loaded: loaded (/etc/systemd/system/daqd_fw.service; enabled)
   Active: active (running) since Mon 2021-10-11 17:48:01 PDT; 23min ago
 Main PID: 2318 (daqd_fw)
   CGroup: /daqd.slice/daqd_fw.service
           └─2318 /usr/bin/daqd_fw -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.fw

Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] [Mon Oct 11 17:48:09 2021] Producer thread - label dqproddbg pid=2440
Oct 11 17:48:09 fb1 daqd_fw[2318]: Producer crc thread priority error Operation not permitted
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] [Mon Oct 11 17:48:09 2021] Producer crc thread put on CPU 0
Oct 11 17:48:09 fb1 daqd_fw[2318]: Producer thread priority error Operation not permitted
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] Producer thread put on CPU 0
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] Producer thread - label dqprod pid=2434
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] Producer thread priority error Operation not permitted
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] Producer thread put on CPU 0
Oct 11 17:48:10 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:10 2021] Minute trender made GPS time correction; gps=1318034906; gps%60=26
Oct 11 17:49:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:49:09 2021] ->3: clear crc

● daqd_rcv.service - Advanced LIGO RTS daqd testpoint receiver
   Loaded: loaded (/etc/systemd/system/daqd_rcv.service; enabled)
   Active: active (running) since Mon 2021-10-11 17:48:00 PDT; 23min ago
 Main PID: 2311 (daqd_rcv)
   CGroup: /daqd.slice/daqd_rcv.service
           └─2311 /usr/bin/daqd_rcv -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.rcv

Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1X07_CRC_SUM
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1BHD_STATUS
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1BHD_CRC_CPS
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1BHD_CRC_SUM
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1SU2_STATUS
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1SU2_CRC_CPS
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1SU2_CRC_SUM
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1OM[Mon Oct 11 17:50:21 2021] Epics server started
Oct 11 17:50:24 fb1 daqd_rcv[2311]: [Mon Oct 11 17:50:24 2021] Minute trender made GPS time correction; gps=1318035040; gps%120=40
Oct 11 17:51:21 fb1 daqd_rcv[2311]: [Mon Oct 11 17:51:21 2021] ->3: clear crc

Now, even before starting teh FE models, I see DC status as ox2bad in the CDS screens of the IOP and user models. The mx_stream service remains in a failed state at teh FE machines and remain the same even after restarting the service.

controls@c1sus2:~ 0$ sudo systemctl status mx_stream -l
● mx_stream.service - Advanced LIGO RTS front end mx stream
   Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
   Active: failed (Result: exit-code) since Mon 2021-10-11 17:50:26 PDT; 15min ago
  Process: 382 ExecStart=/etc/mx_stream_exec (code=exited, status=1/FAILURE)
 Main PID: 382 (code=exited, status=1/FAILURE)

Oct 11 17:50:25 c1sus2 systemd[1]: Starting Advanced LIGO RTS front end mx stream...
Oct 11 17:50:25 c1sus2 systemd[1]: Started Advanced LIGO RTS front end mx stream.
Oct 11 17:50:25 c1sus2 mx_stream_exec[382]: Failed to open endpoint Not initialized
Oct 11 17:50:26 c1sus2 systemd[1]: mx_stream.service: main process exited, code=exited, status=1/FAILURE
Oct 11 17:50:26 c1sus2 systemd[1]: Unit mx_stream.service entered failed state.

But  if I restart the mx_stream service before starting the rtcds models, the mx-stream service starts succesfully:

controls@c1sus2:~ 0$ sudo systemctl restart mx_stream
controls@c1sus2:~ 0$ sudo systemctl status mx_stream -l
● mx_stream.service - Advanced LIGO RTS front end mx stream
   Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
   Active: active (running) since Mon 2021-10-11 18:14:13 PDT; 25s ago
 Main PID: 1337 (mx_stream)
   CGroup: /system.slice/mx_stream.service
           └─1337 /usr/bin/mx_stream -e 0 -r 0 -w 0 -W 0 -s c1x07 c1su2 -d fb1:0

Oct 11 18:14:13 c1sus2 systemd[1]: Starting Advanced LIGO RTS front end mx stream...
Oct 11 18:14:13 c1sus2 systemd[1]: Started Advanced LIGO RTS front end mx stream.
Oct 11 18:14:13 c1sus2 mx_stream_exec[1337]: send len = 263596
Oct 11 18:14:13 c1sus2 mx_stream_exec[1337]: Connection Made

However, the DC status on CDS screens still show 0x2bad. As soon as I start the rtcds model c1x07 (the IOP model for c1sus2), the mx_stream service fails:

controls@c1sus2:~ 0$ sudo systemctl status mx_stream -l
● mx_stream.service - Advanced LIGO RTS front end mx stream
   Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
   Active: failed (Result: exit-code) since Mon 2021-10-11 18:18:03 PDT; 27s ago
  Process: 1337 ExecStart=/etc/mx_stream_exec (code=exited, status=1/FAILURE)
 Main PID: 1337 (code=exited, status=1/FAILURE)

Oct 11 18:14:13 c1sus2 systemd[1]: Starting Advanced LIGO RTS front end mx stream...
Oct 11 18:14:13 c1sus2 systemd[1]: Started Advanced LIGO RTS front end mx stream.
Oct 11 18:14:13 c1sus2 mx_stream_exec[1337]: send len = 263596
Oct 11 18:14:13 c1sus2 mx_stream_exec[1337]: Connection Made
Oct 11 18:18:03 c1sus2 mx_stream_exec[1337]: isendxxx failed with status Remote Endpoint Unreachable
Oct 11 18:18:03 c1sus2 mx_stream_exec[1337]: disconnected from the sender
Oct 11 18:18:03 c1sus2 mx_stream_exec[1337]: c1x07_daq mmapped address is 0x7fe3620c3000
Oct 11 18:18:03 c1sus2 mx_stream_exec[1337]: c1su2_daq mmapped address is 0x7fe35e0c3000
Oct 11 18:18:03 c1sus2 systemd[1]: mx_stream.service: main process exited, code=exited, status=1/FAILURE
Oct 11 18:18:03 c1sus2 systemd[1]: Unit mx_stream.service entered failed state.

This shows that the start of rtcds model, causes the fail in mx_stream, possibly due to inability of finding the endpoint on fb1. I've again reached to the edge of my knowledge here. Maybe the fiber optic connection between fb and the network switch that connects to FE is bad, or the connection between switch and FEs is bad.

But we are just one step away from making this work.

 

 

  16392   Mon Oct 11 18:29:35 2021 AnchalSummaryCDSMoving forward?

The teststand has some non-trivial issue with Myrinet card (either software or hardware) which even teh experts are saying they don't remember how to fix it. CDS with mx was iin use more than a decade ago, so it is hard to find support for issues with it now and will be the same in future. We need to wrap up this test procedure one way or another now, so I have following two options moving forward:


Direct integration with main CDS and testing

  • We can just connect the c1sus2 and c1bhd FE computers to martian network directly.
  • We'll have to connect c1sus2 and c1bhd to the optical fiber subnetwork as well.
  • On booting, they would get booted through the exisitng fb1 boot server which seems to work fine for the other 5 FE machines.
  • We can update teh DHCP in chiara and reload it so that we can ssh into these FEs with host names.
  • Hopefully, presence of these computers won't tank the existing CDS even if they  themselves have any issues, as they have no shared memory with other models.
  • If this works, we can do the loop back testing of I/O chassis using the main DAQ network and move on with our upgrade.
  • If this does not work and causes any harm to exisitng CDS network, we can disconnect these computers and go back to existing CDS. Recently, our confidence on rebooting the CDS has increased with the robust performance as some legacy issues were fixed.
  • We'll however, continue to use a CDS which is no more supported by the current LIGO CDS group.

Testing CDS upgrade on teststand

  • From what I could gather, most of the hardware in I/O chassis that I could find, is still used in CDS of LLO and LHO, with their recent tests and documents using the same cards and PCBs.
  • There might be some difference in the DAQ network setup that I need to confirm.
  • I've summarised the current c1teststand hardware on this wiki page.
  • If the latest CDS is backwards compatible with our hardware, we can test the new CDS in teh c1teststand setup without disrupting our main CDS. We'll have ample help and support for this upgrade from the current LIGO CDS group.
  • We can do the loop back testing of the I/O chassis as well.
  • If the upgrade is succesfull in the teststand without many hardware changes, we can upgrade the main CDS of 40m as well, as it has the same hardware as our teststand.
  • Biggest plus point would be that out CDS will be up-to-date and we will be able to take help from CDS group if any trouble occurs.

So these are the two options we have. We should discuss which one to take in the mattermost chat or in upcoming meeting.

  16395   Tue Oct 12 17:10:56 2021 AnchalSummaryCDSSome more information

Chris pointed out some information displaying scripts, that show if the DAQ network is working or not. I thought it would be nice to log this information here as well.

controls@fb1:/opt/mx/bin 0$ ./mx_info
MX Version: 1.2.16
MX Build: controls@fb1:/opt/src/mx-1.2.16 Mon Aug 14 11:06:09 PDT 2017
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
    8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0:  364.4 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
    Status:        Running, P0: Link Up
    Network:    Ethernet 10G

    MAC Address:    00:60:dd:45:37:86
    Product code:    10G-PCIE-8B-S
    Part number:    09-04228
    Serial number:    423340
    Mapper:        00:60:dd:45:37:86, version = 0x00000000, configured
    Mapped hosts:    3

                                                        ROUTE COUNT
INDEX    MAC ADDRESS     HOST NAME                        P0
-----    -----------     ---------                        ---
   0) 00:60:dd:45:37:86 fb1:0                             1,0
   1) 00:25:90:05:ab:47 c1bhd:0                           1,0
   2) 00:25:90:06:69:c3 c1sus2:0                          1,0

 

controls@c1bhd:~ 1$ /opt/open-mx/bin/omx_info
Open-MX version 1.5.4
 build: root@fb1:/opt/src/open-mx-1.5.4 Tue Aug 15 23:48:03 UTC 2017

Found 1 boards (32 max) supporting 32 endpoints each:
 c1bhd:0 (board #0 name eth1 addr 00:25:90:05:ab:47)
   managed by driver 'igb'

Peer table is ready, mapper is 00:60:dd:45:37:86
================================================
  0) 00:25:90:05:ab:47 c1bhd:0
  1) 00:60:dd:45:37:86 fb1:0
  2) 00:25:90:06:69:c3 c1sus2:0

 

controls@c1sus2:~ 0$ /opt/open-mx/bin/omx_info
Open-MX version 1.5.4
 build: root@fb1:/opt/src/open-mx-1.5.4 Tue Aug 15 23:48:03 UTC 2017

Found 1 boards (32 max) supporting 32 endpoints each:
 c1sus2:0 (board #0 name eth1 addr 00:25:90:06:69:c3)
   managed by driver 'igb'

Peer table is ready, mapper is 00:60:dd:45:37:86
================================================
  0) 00:25:90:06:69:c3 c1sus2:0
  1) 00:60:dd:45:37:86 fb1:0
  2) 00:25:90:05:ab:47 c1bhd:0

These outputs prove that the framebuilder and the FEs are able to see each other in teh DAQ network.


Further, the error that we see when IOP model is started which crashes the mx_stream service on the FE machines (see 40m/16391) :

isendxxx failed with status Remote Endpoint Unreachable

This has been seen earlier when Jamie was troubleshooting the current fb1 in martian network in 40m/11655 in Oct, 2015. Unfortunately, I could not find what Jamie did over a year to fix this issue.

  16396   Tue Oct 12 17:20:12 2021 AnchalSummaryCDSConnected c1sus2 to martian network

I connected c1sus2 to the martian network by splitting the c1sim connection with a 5-way switch. I also ran another ethernet cable from the second port of c1sus2 to the DAQ network switch on 1X7.

Then I logged into chiara and added the following in chiara:/etc/dhcp/dhcpd.conf :

host c1sus2 {
  hardware ethernet 00:25:90:06:69:C2;
  fixed-address 192.168.113.92;
}

And following line in chiara:/var/lib/bind/martian.hosts :

c1sus2          A    192.168.113.92

Note that entires c1bhd is already added in these files, probably during some earlier testing by Gautam or Jon. Then I ran following to restart the dhcp server and nameserver:

~> sudo service bind9 reload
[sudo] password for controls:
 * Reloading domain name service... bind9                                                 [ OK ]
~> sudo service isc-dhcp-server restart
isc-dhcp-server stop/waiting
isc-dhcp-server start/running, process 25764

Now, As I switched on c1sus2 from front panel, it booted over network from fb1 like other FE machines and I was able to login to it by first logging to fb1 and then sshing to c1sus2.

Next, I copied the simulink models and the medm screens of c1x06, xc1x07, c1bhd, c1sus2 from the paths mentioned on this wiki page. I also copied the medm screens from chiara(clone):/opt/rtcds/caltech/c1/medm to martian network chiara in the appropriate places. I have placed the file /opt/rtcds/caltech/c1/medm/teststand_sitemap.adl which can be used to open sitemap for c1bhd and c1sus2 IOP and user models.

Then I logged into c1sus2 (via fb1) and did make, install, start procedure:

controls@c1sus2:~ 0$ rtcds make c1x07
buildd: /opt/rtcds/caltech/c1/rtbuild/release
### building c1x07...
Cleaning c1x07...
Done
Parsing the model c1x07...
Done
Building EPICS sequencers...
Done
Building front-end Linux kernel module c1x07...
Done
RCG source code directory:
/opt/rtcds/rtscore/branches/branch-3.4
The following files were used for this build:
/opt/rtcds/userapps/release/cds/c1/models/c1x07.mdl

Successfully compiled c1x07
***********************************************
Compile Warnings, found in c1x07_warnings.log:
***********************************************
***********************************************
controls@c1sus2:~ 0$ rtcds install c1x07
buildd: /opt/rtcds/caltech/c1/rtbuild/release
### installing c1x07...
Installing system=c1x07 site=caltech ifo=C1,c1
Installing /opt/rtcds/caltech/c1/chans/C1X07.txt
Installing /opt/rtcds/caltech/c1/target/c1x07/c1x07epics
Installing /opt/rtcds/caltech/c1/target/c1x07
Installing start and stop scripts
/opt/rtcds/caltech/c1/scripts/killc1x07
/opt/rtcds/caltech/c1/scripts/startc1x07
sudo: unable to resolve host c1sus2
Performing install-daq
Updating testpoint.par config file
/opt/rtcds/caltech/c1/target/gds/param/testpoint.par
/opt/rtcds/rtscore/branches/branch-3.4/src/epics/util/updateTestpointPar.pl -par_file=/opt/rtcds/caltech/c1/target/gds/param/archive/testpoint_211012_174226.par -gds_node=24 -site_letter=C -system=c1x07 -host=c1sus2
Installing GDS node 24 configuration file
/opt/rtcds/caltech/c1/target/gds/param/tpchn_c1x07.par
Installing auto-generated DAQ configuration file
/opt/rtcds/caltech/c1/chans/daq/C1X07.ini
Installing Epics MEDM screens
Running post-build script

safe.snap exists 
controls@c1sus2:~ 0$ rtcds start c1x07
Cannot start/stop model 'c1x07' on host c1sus2.
controls@c1sus2:~ 4$ rtcds list

controls@c1sus2:~ 0$ 

One can see that even after making and installing, the model c1x07 is not listed as available models in rtcds list. Same is the case for c1sus2 as well. So I could not proceed with testing.

Good news is that nothing that I did affect the current CDS functioning. So we can probably do this testing safely from the main CDS setup.

  16397   Tue Oct 12 23:42:56 2021 KojiSummaryCDSConnected c1sus2 to martian network

Don't you need to add the new hosts to /diskless/root/etc/rtsystab at fb1? --> There looks many elogs talking about editing "rtsystab".

controls@fb1:/diskless/root/etc 0$ cat rtsystab
#
# host    list of control systems to run, starting with IOP
#
c1iscex  c1x01  c1scx c1asx
c1sus     c1x02  c1sus c1mcs c1rfm c1pem
c1ioo     c1x03  c1ioo c1als c1omc
c1lsc    c1x04  c1lsc c1ass c1oaf c1cal c1dnn c1daf
c1iscey  c1x05 c1scy c1asy
#c1test   c1x10  c1tst2

 

  16398   Wed Oct 13 11:25:14 2021 AnchalSummaryCDSRan c1sus2 models in martian CDS. All good!

Three extra steps (when adding new models, new FE):

  • Chris pointed out that the sudo command in c1sus2 is giving error
    sudo: unable to resolve host c1sus2
    
    This error comes in when the computer could not figure out it's own hostname. Since FEs are network booted off the fb1, we need to update the /etc/hosts in /diskless/root everytime we add a new FE.
    controls@fb1:~ 0$ sudo chroot /diskless/root
    fb1:/ 0# sudo nano /etc/hosts
    fb1:/ 0# exit
    
    I added the following line in /etc/hosts file above:
    192.168.113.92  c1sus2 c1sus2.martian
    
    This resolved the issue of sudo giving error. Now, the rtcds make and install steps had no errors mentioned in their outputs.
  • Another thing that needs to be done, as Koji pointed out, is to add the host and models in /etc/rtsystab in /diskless/root of fb:
    controls@fb1:~ 0$ sudo chroot /diskless/root
    fb1:/ 0# sudo nano /etc/rtsystab
    fb1:/ 0# exit
    
    I added the following lines in /etc/rtsystab file above:
    c1sus2   c1x07  c1su2
    
    This told rtcds what models would be available on c1sus2. Now rtcds list is displaying the right models:
    controls@c1sus2:~ 0$ rtcds list
    c1x07
    c1su2
  • The above steps are still not sufficient for the daqd_ processes to know about the new models. This part is supossed to happen automatically, but does not happen in our CDS apparently. So everytime there is a new model, we need to edit the file /opt/rtcds/caltech/c1/target/daqd/master and add following lines to it:
    # Fast Data Channel lists
    # c1sus2
    /opt/rtcds/caltech/c1/chans/daq/C1X07.ini
    /opt/rtcds/caltech/c1/chans/daq/C1SU2.ini
    
    # test point lists
    # c1sus2
    /opt/rtcds/caltech/c1/target/gds/param/tpchn_c1x07.par
    /opt/rtcds/caltech/c1/target/gds/param/tpchn_c1su2.par
    
    I needed to restart the daqd_ processes in  fb1 for them to notice these changes:
    controls@fb1:~ 0$ sudo systemctl restart daqd_*
    
    This finally lit up the status channels of DC in C1X07_GDS_TP.adl and C1SU2_GDS_TP.adl . However the channels C1:DAQ-DC0_C1X07_STATUS and C1:DAQ-DC0_C1SU2_STATUS both have values 0x2bad. This persists on restarting the models. I then just simply restarted teh mx_stream on c1sus2 and boom, it worked! (see attached all green screen, never seen before!)

So now Ian can work on testing the I/O chassis and we would be good to move c1sus2 FE and I/O chassis to 1Y3 after that. I've also done following extra changes:

  • Updated CDS_FE_STATUS medm screen to show the new c1sus2 host.
  • Updated global diag rest script to act on c1xo7 and c1su2 as well.
  • Updated mxstream restart script to act on c1sus2 as well.
  16414   Tue Oct 19 18:20:33 2021 Ian MacMillanSummaryCDSc1sus2 DAC to ADC test

I ran a DAC to ADC test on c1sus2 channels where I hooked up the outputs on the DAC to the input channels on the ADC. We used different combinations of ADCs and DACs to make sure that there were no errors that cancel each other out in the end. I took a transfer function across these channel combinations to reproduce figure 1 in T2000188.

As seen in the two attached PDFs the channels seem to be working properly they have a flat response with a gain of 0.5 (-6 dB). This is the response that is expected and is the result of the DAC signal being sent as a single ended signal and the ADC receiving as a differential input signal. This should result in a recorded signal of 0.5 the amplitude of the actual output signal.

The drop off on the high frequency end is the result of the anti-aliasing filter and the anti-imaging filter. Both of these are 8-pole elliptical filters so when combined we should get a drop off of 320dB per decade. I measured the slope on the last few points of each filter and the averaged value was around 347dB per decade. This is slightly steeper than expected but since it is to cut off higher frequencies it shouldn't have an effect on the operation of the system. Also it is very close to the expected value.

The ripples seen before the drop off are also an effect of the elliptical filters and are seen in T2000188.

Note: the transfer function that doesn't seem to match the others is the heartbeat timing signal.

  16415   Tue Oct 19 23:43:09 2021 KojiSummaryCDSc1sus2 DAC to ADC test

(Because of a totally unrelated reason) I was checking the electronics units for the upgrade. And I realized that the electronics units at the test stand have not been properly powered.

I found that the AA/AI stack at the test stand (Attachment 1) has an unusual powering configuration (Attachment 2).
- Only the positive power supply was used / - The supply voltage is only +15V / - The GND reference is not connected to anywhere.

For confirmation, I checked the voltage across the DC power strip (Attachments 3/4). The positive was +5.3V and the negative was -9.4V. This is subject to change depending on the earth potential.

This is not a good condition at all. The asymmetric powering of the circuit may cause damages to the opamps. So I turned off the switches of the units.

The power configuration should be immediately corrected.

  1. Use both positive and negative supply (2 power supply channels) to produce the positive and the negative voltage potentials. Connect the reference potential to the earth post of the power supply.
    https://www.youtube.com/watch?v=9_6ecyf6K40   [Dual Power Supply Connection / Serial plus minus electronics laboratory PS with center tap]
  2. These units have DC power regulator which produces +/-15V out of +/-18V. So the DC power supplies are supposed to be set at +18V.

 

  16417   Wed Oct 20 11:48:27 2021 AnchalSummaryCDSPower supple configured correctly.

This was horrible! That's my bad, I should have checked the configuration before assuming that it is right.

I fixed the power supply configuration. Now the strip has two rails of +/- 18V and the GND is referenced to power supply earth GND.

Ian should redo the tests.

  16430   Tue Oct 26 18:24:00 2021 Ian MacMillanSummaryCDSc1sus2 DAC to ADC test

[Ian, Anchal, Paco]

After the Koji found that there was a problem with the power source Anchal and I fixed the power then reran the measurment. The only change this time around is that I increased the excitation amplitude to 100. In the first run the excitation amplitude was 1 which seemed to come out noise free but is too low to give a reliable value.

link to previous results

The new plots are attached.

  16495   Thu Dec 9 00:32:56 2021 TegaUpdateCDSNew SUS medm screen update

The new SUS screen can be reached via sitemap -> IFO SUS button -> NEW ETMX dropdown menu link. Please use and provide feedback. Not sure exactly if we need/want the display screens after the IOP model on the right of the medm screen. I have not been able to locate the corresponding channels but did not want to remove them until I was sure that we don't plan to add these features to our screens. When all bugs have been ironed out, we can use appropriate macro substitution for the other optics.

The next feature to add is the BLRMS to the coil and PD channels. I plan to combine the PEM BLRMS medm implementation with the sus_single_BLRMS model block (located in  /opt/rtcds/userapps/release/cds/c1/models). This way we use the latest BLRMS block in "/opt/rtcds/userapps/release/cds/common/models/BLRMS.mdl" whilst also leveraging the previous work done on the sus_single_BLRMS model, which neatly fits into our current SUS model.

  16496   Thu Dec 9 18:22:36 2021 TegaUpdateCDSNew SUS medm screen update

Work on the medm screen for SUS RMS monitor is ongoing. The next step would be to incorporate this into the SUS medm screen, add the BLRMS model to the SUS controller model, recompile, check that the channels are being correctly addressed, then load the appropriate bandpass and lowpass filters.  

  16500   Fri Dec 10 18:55:58 2021 TegaUpdateCDSNew SUS medm screen update

Turns out the BLRMS monitoring channels for MC1, MC2, MC3, ITMY and SRM already exist in c1pem. So I modified the new SUS screen to display the BLRMS info for the aforementioned optics. Next step is to add the BLRMS monitor for PRM, ITMX, ETMX and ETMY. This would require extending the number of inputs for the "SUS" block in c1pem to accomodate the additional inputs from the remaining optics.

  16533   Wed Dec 22 17:40:22 2021 AnchalSummaryCDSc1su2 model updated with SUS damping blocks for 7 SOSs

[Anchal, Koji]

I've updated the c1su2 model today with model suspension blocks for the 7 new SOSs (LO1, LO2, AS1, AS4, SR2, PR2 and PR3). The model is running properly now but we had some difficulty in getting it to run.

Initially, we were getting 0x2000 error on the c1su2 model CDS screen. The issue probably was high data transmission required for all the 7 SOSs in this model. Koji dug up a script /opt/rtcds/caltech/c1/userapps/trunk/cds/c1/scripts/activateDQ.py that has been used historically for updating the data rate on some of theDQ channels in the suspension block. However, this script was not working properly for Koji, so he create a new script at /opt/rtcds/caltech/c1/chans/daq/activateSUS2DQ.py.

[Ed by KA: I could not make this modified script run so that I replaces the input file (i.e. C1SU2.ini). So the output file is named C1SU2.ini.NEW and need to manually replace the original file.]

With this, Koji was able to reduce acquisition rate of SUSPOS_IN1_DQ, SUSPIT_IN1_DQ, SUSYAW_IN1_DQ, SUSSIDE_IN1_DQ, SENSOR_UL, SENSOR_UR, SENSOR_LL,SENSOR_LR, SENSOR_SIDE, OPLEV_PERROR, OPLEV_YERROR, and OPLEV_SUM to 2048 Sa/s. The script modifies the /opt/rtcds/caltech/c1/chans/daq/C1SU2.ini file which would get re-written if c1su2 model is remade and reinstalled. After this modification, the 0x2000 error stopped appearing and the model is running fine.


Should we change the library model part for sus_single_control.mdl

We notice that all our suspension models need to go through this weird python script modifying auto-generated .ini files to reduce the data rate. Ideally, there is a simpler solution to this by simply adding the datarate 2048 in the '#DAQ Channels' block in the model library part /cvs/cds/rtcds/userapps/trunk/sus/c1/models/lib/sus_single_control.mdl which is the root model in all the suspensions. With this change, the .ini files will automatically be written with correct datarate and there will be no need for using the activateDQ script. But we couldn't find why this simple solution was not implemented in the past, so we want to know if there is more stuff going on here then we know. Changing the library model would obviously change every suspension model and we don't want a broken CDS system on our head at the begining of holidays, so we'll leave this delicate task for the near future.

  16537   Wed Dec 29 20:09:40 2021 ranaSummaryCDSc1su2 model updated with SUS damping blocks for 7 SOSs

We want to maintain the 16 kHz sample rate for the COIL DAQ channels, but nothing wrong with reducing the others.

I would suggest setting the DQ sample rates to 256 Hz for the SUS DAMP channels and 1024 Hz for the OPLEV channels (for noise diagnostics).

Maybe you can put these numbers into a new library part and we can have the best of all worlds?

Quote:
 

Should we change the library model part for sus_single_control.mdl

We notice that all our suspension models need to go through this weird python script modifying auto-generated .ini files to reduce the data rate. Ideally, there is a simpler solution to this by simply adding the datarate 2048 in the '#DAQ Channels' block in the model library part /cvs/cds/rtcds/userapps/trunk/sus/c1/models/lib/sus_single_control.mdl which is the root model in all the suspensions. With this change, the .ini files will automatically be written with correct datarate and there will be no need for using the activateDQ script. But we couldn't find why this simple solution was not implemented in the past, so we want to know if there is more stuff going on here then we know. Changing the library model would obviously change every suspension model and we don't want a broken CDS system on our head at the begining of holidays, so we'll leave this delicate task for the near future.

 

  16546   Thu Jan 6 12:52:49 2022 AnchalUpdateCDSYearly DAQD fix 2022!

Just as predicted, all realtime models reported "0x4000" error. Read the parent post for more details. I fixed this by following the instructions. I add folowing lines to the file /opt/rtcds/rtscore/release/src/include/drv/spectracomGPS.c in fb1:

/* 2020 had 366 days and no leap second */
       pHardware->gpsOffset += 31622400;
/* 2021 had no leap seconds or leap days, so adjust for that */
       pHardware->gpsOffset += 31536000;

Then is made the package and reloaded it after stoping the daqd services. This brought back all the fast models except C1SUS2 models which are in red due to some other reason that I'll investigate further.

 

ELOG V3.1.3-