ID |
Date |
Author |
Type |
Category |
Subject |
15518
|
Wed Aug 12 20:14:06 2020 |
Koji | Update | CDS | Timing distribution slot availability |
I believe we will use two new chassis at most. We'll replace c1ioo from Sun to Supermicro, but we recycle the existing timing system. |
15521
|
Thu Aug 13 11:30:19 2020 |
gautam | Update | CDS | Timing distribution slot availability |
That's great. I wonder if we can also get away with not adding new Dolphin infrastructure. I'd really like to avoid changing any IPC drivers.
Quote: |
I believe we will use two new chassis at most. We'll replace c1ioo from Sun to Supermicro, but we recycle the existing timing system.
|
|
15522
|
Thu Aug 13 13:35:13 2020 |
Koji | Update | CDS | Timing distribution slot availability |
The new dolphin eventually helps us. But the installation is an invasive change to the existing system and should be done at the installation stage of the 40m BHD. |
15524
|
Fri Aug 14 00:01:55 2020 |
gautam | Update | CDS | BHD / OMC model channels now added to autoburt |
I added the EPCIS channels for the c1omc model (gains, matrix elements etc) to the autoburt such that we have a record of these, since we expect these models to be running somewhat regularly now, and I also expect many CDS crashes. |
15525
|
Fri Aug 14 10:03:37 2020 |
Jon | Update | CDS | Timing distribution slot availability |
That's great news we won't have to worry about a new timing fanout for the two new machines, c1bhd and c1sus2. And there's no plan to change Dolphin IPC drivers. The plan is only to install the same (older) version of the driver on the two new machines and plug into free slots in the existing switch.
Quote: |
The new dolphin eventually helps us. But the installation is an invasive change to the existing system and should be done at the installation stage of the 40m BHD.
|
|
15564
|
Tue Sep 8 11:49:04 2020 |
gautam | Update | CDS | Some path changes |
I edited /diskless/root.jessie/home/controls/.bashrc so that I don't have to keep doing this every time I do a model recompile.
Quote: |
Where is this variable set and how can I add the new paths to it?
export RCG_LIB_PATH=/opt/rtcds/userapps/release/isc/c1/models/isc/:/opt/rtcds/userapps/release/isc/c1/models/cds/:/opt/rtcds/userapps/release/isc/c1/models/sus/:$RCG_LIB_PAT
|
|
15578
|
Wed Sep 16 17:44:27 2020 |
gautam | Update | CDS | All vertex FE models restarted |
I had to make a CDS change to the c1lsc model in an effort to get a few more signals into the models. Rather than risk requiring hard reboots (typcially my experience if I try to restart a model), I opted for the more deterministic scripted reboot, at the expense of spending ~20mins to get everything back up and running.
Update 2230: this was more complicated than expected - a nuclear reboot was necessary but now everything is back online and functioning as expected. While all the CDS indicators were green when I wrote this up at ~1800, the c1sus model was having frequent CPU overflows (execution time > 60 us). Not sure why this happened, or why a hard power reboot of everything fixed it, but I'm not delving into this.
The point of all this was that I can now simultaneously digitize 4 channels - 2 DCPDs, and 2 demodulated quadratures of an RF signal. |
Attachment 1: CDS.png
|
|
15609
|
Sat Oct 3 16:51:27 2020 |
gautam | Update | CDS | RFM errors |
Attachment #1 shows that the c1rfm model isn't able to receive any signals from the front end machines at EX and EY. Attachment #2 shows that the problem appears to have started at ~430am today morning - I certainly wasn't doing anything with the IFO at that time.
I don't know what kind of error this is - what does it mean that the receiving model shows errors but the sender shows no errors? It is not a new kind of error, and the solution in the past has been a series of model reboots, but it'd be nice if we could fix such issues because it eats up a lot of time to reboot all the vertex machines. There is no diagnostic information available in all the places I looked. I'll ask the CDS group for help, but I'm not sure if they'll have anything useful since this RFM technology has been retired at the sites (?).
In the meantime, arm cavity locking in the usual way isn't possible since we don't have the trigger signals from the arm cavity transmission.
Update 1500 4 Oct: soft reboots of models didn't do the trick so I had to resort to hard reboots of all FEs/expansion chassis. Now the signals seem to be okay. |
Attachment 1: RFMstat.png
|
|
Attachment 2: RFMerrs.png
|
|
15646
|
Wed Oct 28 09:35:00 2020 |
Koji | Update | CDS | RFM errors |
I'm starting the model restarts from remote. Then later I'll show up in the lab to do more hard resets.
==> It seems that the RFM errors are gone. Here are the steps.
- Shutdown all the watchdogs
- login to c1iscex. Shutdown all the realtime models: rtcds kill --all
- login to c1iscey. Shutdown all the realtime models: rtcds kill --all
- run scripts/cds/rebootC1LSC.sh on pianosa
- reboot c1iscex
- reboot c1isxey
- Wait until all the machines/models are up by the script
- restart c1iscex models
- restart c1iscey models
- some IPC errors are still visible on the CDS status screen. Lauch c1daf and c1oaf
|
Attachment 1: Screen_Shot_2020-10-28_at_10.06.00.png
|
|
15659
|
Wed Nov 4 17:14:49 2020 |
gautam | Update | CDS | c1bhd setup |
I am working on the setup of a CDS FE, so please do not attempt any remote login to the IPMI interface of c1bhd until I'm done.
|
15663
|
Fri Nov 6 14:27:16 2020 |
gautam | Update | CDS | c1bhd setup - diskless boot |
I was able to boot one of the 3 new Supermicro machines, which I christened c1bhd, in a diskless way (with the boot image hosted on fb, as is the case for all the other realtime FEs in the lab). This is just a first test, but it is reassuring that we can get this custom linux kernel to boot on the new hardware. Some errors about dolphin drivers are thrown at startup but this is to be expected since the server isn't connected to the dolphin network yet. We have the Dolphin adaptor card in hand, but since we have to get another PCIe card (supposedly from LLO according to the BHD spreadsheet), I defer installing this in the server chassis until we have all the necessary hardware on hand.
I also have to figure out the correct BIOS settings for this to really run effectively as a FE (we have to disable all the "un-necessary" system level services) - these machines have BIOS v3.2 as opposed to the older vintages for which there are instructions from K.T. et al.
There may yet be issues with drivers, but this is all the testing that can be done without getting an expansion chassis. After the vent and recovering the IFO, I may try experimenting with the c1ioo chassis, but I'd much prefer if we can do the testing offline on a subnet that doesn't mess with the regular IFO operation (until we need to test the IPC).
Quote: |
I am working on the setup of a CDS FE, so please do not attempt any remote login to the IPMI interface of c1bhd until I'm done.
|
|
Attachment 1: Screenshot_2020-11-06_14-26-54.png
|
|
15695
|
Wed Dec 2 17:54:03 2020 |
gautam | Update | CDS | FE reboot |
As discussed at the meeting, I commenced the recovery of the CDS status at 1750 local time.
- Started by attempting to just soft-restart the c1rfm model and see if that fixes the issue. It didn't and what's more, took down the c1sus machine.
- So hard reboots of the vertex machines was required. c1iscey also crashed. I was able to keep the EX machine up, but I soft-stopped all the RT models on it.
- All systems were recovered by 1815. For anyone checking, the DC light on the c1oaf model is red - this is a "known" issue and requires a model restart, but i don't want to get into that now and it doesn't disrupt normal operation.
Single arm POX/POY locking was checked, but not much more. Our IMC WFS are still out of service so I hand aligned the IMC a bit, IMC REFL DC went from ~0.3 to ~0.12, which is the usual nominal level. |
15719
|
Wed Dec 9 15:37:48 2020 |
gautam | Update | CDS | RFM switch IP addr reset |
I suspect what happened here is that the IP didn't get updated when we went from the 131.215.113.xxx system to 192.168.113.xxx system. I fixed it now and can access the web interface. This system is now ready for remote debugging (from inside the martian network obviously). The IP is 192.168.113.90.
Managed to pull this operation off without crashing the RFM network, phew.
BTW, a windows laptop that used to be in the VEA (I last remember it being on the table near MC2 which was cleared sometime to hold the spare suspensions) is missing. Anyone know where this is ? |
Attachment 1: Screenshot_2020-12-09_15-39-20.png
|
|
Attachment 2: Screenshot_2020-12-09_15-46-46.png
|
|
15737
|
Fri Dec 18 10:52:17 2020 |
gautam | Update | CDS | RFM errors |
As I was working on the IFO re-alignment just now, the rfm errors popped up again. I don't see any useful diagnostics on the web interface.
Do we want to take this opportunity to configure jumpers and set up the rogue master as Rolf suggested? Of course there's no guarantee that will fix anything, and may possibly make it impossible to recover the current state... |
Attachment 1: RFMdiag.png
|
|
15738
|
Fri Dec 18 22:59:12 2020 |
Jon | Configuration | CDS | Updated CDS upgrade plan |
Attached is the layout for the "intermediate" CDS upgrade option, as was discussed on Wednesday. Under this plan:
-
Existing FEs stay where they are (they are not moved to a single rack)
-
Dolphin IPC remains PCIe Gen 1
-
RFM network is entirely replaced with Dolphin IPC
Please send me any omissions or corrections to the layout. |
Attachment 1: CDS_2020_Dec.pdf
|
|
Attachment 2: CDS_2020_Dec.graffle
|
15742
|
Mon Dec 21 09:28:50 2020 |
Jamie | Configuration | CDS | Updated CDS upgrade plan |
Quote: |
Attached is the layout for the "intermediate" CDS upgrade option, as was discussed on Wednesday. Under this plan:
-
Existing FEs stay where they are (they are not moved to a single rack)
-
Dolphin IPC remains PCIe Gen 1
-
RFM network is entirely replaced with Dolphin IPC
Please send me any omissions or corrections to the layout.
|
I just want to point out that if you move all the FEs to the same rack they can all be connected to the Dolphin switch via copper, and you would only have to string a single fiber to every IO rack, rather than the multiple now (for network, dolphin, timing, etc.). |
15743
|
Mon Dec 21 18:18:03 2020 |
gautam | Update | CDS | Many model changes |
The CDS model change required to get the AS WFS signals into the RTCDS system are rather invasive.
- We use VCS for these models. Linus Torvald may question my taste but I also made local backups of the models, just in case...
- Particularly, the ADC1 card on c1ioo is completely reconfigured.
- I also think it's more natural to do all the ASC computations on this machine rather than the c1lsc machine (it is currently done in the c1ass model). So there are many IPC changes as well.
- I have documented everything carefully, and the compile/install went smoothly.
- Taking down all the FE servers at 1830 local time.
- To propagagte the model changes
- To make a hardware change in the c1rfm card in the c1ioo machine to configure it as "ROGUE MASTER 0"
- To clear the RFM errors we are currently suffering from will require a model reboot anyways.
- Recovery was completed by 1930 - the RFM errors are also cleared, and now we have a "ROGUE MASTER 👾" on the network. Pretty smooth, no major issues with the CDS part of the procedure to report.
- The main issue is that in the AA chassis I built, Ch14 (with the first channel as Ch1) has the output saturated to 28V (differential). I'm not sure what kind of overvoltage protection the ADC has - we frequently have the inputs exceed the spec'd +/-20 V (e.g. when the whitening filters are engaged and the cavity is fringing), but pending further investigation, I am removing the SCSI connection from the rear of the AA chassis.
In terms of computational load, the c1ioo model seems to be able to handle the extra load no issues - ~35us/60us per cycle. The RFM model shows no extra computational time.
After this work, the IMC locking and POX/POY locking, and dither alignment servos are working okay. So I have some confidence that my invasive work hasn't completely destroyed everything. There is some hardware around the rear of 1X2 that I will clear tomorrow. |
Attachment 1: CDSoverview.png
|
|
15744
|
Tue Dec 22 22:11:37 2020 |
gautam | Update | CDS | AA filt repaired and reinstalled |
Koji fixed the problematic channel - the issue was a bad solder joint on the input resistors to the THS4131. The board was re-installed. I also made a custom 2x4-pin LEMO-->DB9 cable, so we are now recording the PMC and FSS ERR/CTRL channel diagnostics again (spectra tomorrow). Note that Ch32 is recording some sort of DuoTone signal and so is not usable. This is due to a misconfiguration - ADC0 CH31 is the one which is supposed to be reserved for this timing signal, and not ADC1 as we currently have. When we swap the c1ioo hosts, we should fix this issue.
I also did most of the work to make the MEDM screens for the revised ASC topology, tried to mirror the site screens where possible. The overview screen remains to be done. I also loaded the anti-whitening filters (z:p 150:15) at the demodulated WFS input signal entry points. We don't have remote whitening switching capability at this time, so I'll test the switching manually at some point.
Quote: |
The main issue is that in the AA chassis I built, Ch14 (with the first channel as Ch1) has the output saturated to 28V (differential). I'm not sure what kind of overvoltage protection the ADC has - we frequently have the inputs exceed the spec'd +/-20 V (e.g. when the whitening filters are engaged and the cavity is fringing), but pending further investigation, I am removing the SCSI connection from the rear of the AA chassis.
|
|
15745
|
Wed Dec 23 10:13:08 2020 |
gautam | Update | CDS | Near term upgrades |
Summary:
- There appears to be insufficient number of PCIe slots in the new Supermicro servers that were bought for the BHD upgrade.
- Modulo a "riser card", we have all the parts in hand to put one of the end machines on the Dolphin network. If the Rogue Master doesn't improve the situation, we should consider installing a Dolphin card in the c1iscex server and connecting it to the Dolphin network at the next opportunity.
Details:
Last night, I briefly spoke with Koji about the CDS upgrade plan. This is a follow up.
Each server needs a minimum of two peripheral devices added to the PCIe bus:
- A PCIe interface card that connects the server to the Expansion Chassis (copper or optical fiber depending on distance between the two).
- A Dolphin or RFM card that makes the IPC interconnects.
- I'm pretty certain the expansion chasiss cannot support the Dolphin / RFM card (it's only meant to be for ADCs/DACs/BIO cards). At least, all the existing servers in the 40m have at least 2 PCIe cards installed, and I think we have enough to worry about without trying to engineer a hack.
- I attach some photos of new and old Supermicro servers to illustrate my point, see Attachment #1.
As for the second issue, the main question is if we had an open PCIe slot on the c1iscex machine to install a Dolphin card. Looks like the 2 standard slots are taken (see Attachment #1), but a "low profile" slot is available. I can't find what the exact models of the Supermicro servers installed back in 2010 are, but maybe it's this one? It's a good match visually anyways. The manual says a "riser card" is required. I don't know if such a riser is already installed.
Questions I have, Rolf is probably the best person to answer:
- Can we even use the specified host adaptor, HIB25-X4, which is "PCIe Gen2", with the "PCIe Gen3" slots on the new Supermicro servers we bought? Anyway, the fact that the new servers have only 1 PCIe slot probably makes this a moot point.
- Can we overcome this slot limitation by installing the Dolphin / RFM card in the expansion chassis?
- In the short run (i.e. if it is much faster than the full CDS shipment we are going to receive), can we get (from the CDS test stand in Downs or the site)
- A riser card so that we may install the Gen 1 Dolphin card (which we have in hand) in the c1iscex server?
- A compatible (not sure what PCIe generation we need) PCIe host to ECA kit so we can test out the replacement for the Sun Microsystems c1ioo server?
- A spare RFM card (VMIC 5565, also for the above purpose).
- What sort of a test should we run to test the new Dolphin connection? Make a "null channel" differencing the same signal (e.g. TRX) sent via RFM and Dolphin? Or is there some better checksum diagnostic?
|
Attachment 1: IMG_0020.pdf
|
|
15746
|
Wed Dec 23 23:06:45 2020 |
gautam | Configuration | CDS | Updated CDS upgrade plan |
- The diagram should clearly show the host machines and the expansion chassis and the interconnects between them.
- We no longer have any Gentoo bootserver or diskless FEs.
- The "c1lsc" host is in 1X4 not 1Y3.
- The connection between c1lsc and Dolphin switch is copper not fiber. I don't know how many Gbps it is. But if the switch is 10 Gbps, are they really selling interface cables that have lower speed? The datasheet says 10 Gbps.
- The control room workstations - Debian10 (rossa) is the way forward I believe. it is true pianosa remains SL7 (and we should continue to keep it so until all other machines have been upgraded and tested on Debian 10).
- There is no "IOO/OAF". The host is called "c1ioo".
- The interconnect between Dolphin switch and c1ioo host is via fiber not copper.
- It'd be good to have an accurate diagram of the current situation as well (with the RFM network).
- I'm not sure if the 1Y1 rack can accommodate 2 FEs and 2 expansion chassis. Maybe if we clear everything else there out...
- There are 2 "2GB/s" Copper traces. I think the legend should make clear what's going on - i.e. which cables are ethernet (Cat 6? Cat 5? What's the speed limitation? The cable? Or the switch?), which are PCIe cables etc etc.
I don't have omnigraffle - what about uploading the source doc in a format that the excellent (and free) draw.io can handle? I think we can do a much better job of making this diagram reflect reality. There should also be a corresponding diagram for the Acromag system (but that doesn't have to be tied to this task). Megatron (scripts machine) and nodus should be added to that diagram as well.
Please send me any omissions or corrections to the layout.
|
|
15760
|
Tue Jan 12 08:21:47 2021 |
anchal | HowTo | CDS | Acromag wiring investigation |
I used an Acromag XT1221 in CTN to play around with different wiring and see what works. Following are my findings:
Referenced Single Ended Source (Attachment 1):
- If the source signal is referenced single ended, i.e. the signal is only on the positive output and the negative side is tied to GND on the source side AND this GND is also shared by the power supply powering the Acromag, then no additional wiring is required.
- The GND common to the power supply and the source is not required to be Earth GND but if done so, it should be at one point only.
- RTN terminal on Acromag can be left floating or tied to IN- terminal.
Floating Single Ended Source (Attachment 2):
- If the source signal is floating single-ended i.e. the signal is only on the positive output and the negative output is a floating GND on the source, the the IN- should be connected to RTN.
- This is the case for handheld calibrators or battery powered devices.
- Note that there is no need to connect GND of power supply to the floating GND on the source.
Differential Source (Attachment 3):
- If the source is differential output i.e. the signal is on both the positive output and the negative output, then connect one of the RTN terminals on Acromag to Earth GND. It has to be Earth GND for this to work.
- Note that you can no longer tie the IN- of different signals to RTN as they are all carrying different negative output from the source.
- Earth GND at RTN gives acromag a stable voltage reference to measure against the signals coming in IN+ and IN-. And the most stable voltage reference is of course Earth GND.
Conclusion:
- We might have a mix of these three types of signals coming to a single Acromag box. In that case, we have to make sure we are not connecting the different IN- to each other (maybe through RTN) as the differential negative inputs carry signal, not a constant voltage value.
- In this case, I think it would be fine to always use differential signal wiring diagram with the RTN connected to Earth GND.
- There's no difference in software configuration for the two types of inputs, differential or single-ended.
- For cases in which we install the acromag box inside a rack mount chasis with an associated board (example: CTN/2248), it is ok and maybe the best to use the Attachment 1 wiring diagram.
Comments and suggestions are welcome.
Related elog posts:
40m/14841 40m/15134
Edit Tue Jan 26 12:44:19 2021 :
Note that the third wiring diagram mentioned actually does not work. It is an error in judgement. See 40m/15762 for seeing what happens during this. |
Attachment 1: SingleEndedNonFloatingWiring.pdf
|
|
Attachment 2: SingleEndedFloatingWiring.pdf
|
|
Attachment 3: DifferentialSignalWiring.pdf
|
|
15761
|
Tue Jan 12 11:42:38 2021 |
gautam | HowTo | CDS | Acromag wiring investigation |
Thanks for the systematic effort.
- Can you please post some time domain plots (ndscope perferably or StripTool) to clearly show the different failure modes?
- The majority of our AI channels are "Referenced Single Ended Source" in your terminology. At least on the c1psl Acromag crate, there are no channels that are truly differential drive (case #3 in your terminology). I think the point is that we saw noisy inputs when the IN- wasn't connected to RTN. e.g. the thorlabs photodiode has a BNC output with the shield connected to GND and the central conductor carrying a signal, and that channel was noisy when the RTN was unconnected. Is that consistent with your findings?
- What is the prescription when we have multiple power supplies (mixture of Sorensens in multiple racks, Thorlabs photodiodes and other devices powered by an AC/DC converter) involved?
- I'm still not entirely convinced of what the solution is, or that this is the whole picture. On 8 Jan, I disconnected (and then re-connected) the FSS RMTEMP sensor from the Acromag box, to check if the sensor output was noisy or if it was the Acromag. The problem surfaced on Dec 15, when I installed some new electronics in the rack (though none of them were connected to the Acromag directly, the only common point was the Sorensen DCPS. And between 8 Jan and today, the noise RMS has decreased back to the nominal level, without me doing anything to the grounding. How to reconcile this?
|
15762
|
Wed Jan 13 16:09:29 2021 |
Anchal | HowTo | CDS | Acromag wiring investigation |
I'm working on a better wiring diagram that takes into account multiple power supplies, how their GND is passed forward to the circuits or sensors using those power supplies and what possible wiring configurations on Acromag would give low noise. I think I have two configurations in mind which I will test and update here with data and better diagrams.
I took some striptool images earlier yesterday. So I'm dumping them here for further comments or inferences. |
Attachment 1: SimpleTestsStriptoolImages.pdf
|
|
15763
|
Thu Jan 14 11:46:20 2021 |
gautam | Update | CDS | Expansion chassis from LHO |
I picked the boxes up this morning. The inventory per Fil's email looks accurate. Some comments:
- They shipped the chassis and mounting parts (we should still get rails to mount these on, they're pretty heavy to just be supported on 4 rack nuts on the front). idk if we still need the two empty chassis that were requested from Rich.
- Regarding the fibers - one of the fibers is pre-2012. These are known to fail (according to Rolf). One of the two that LHO shipped is from 2012 (judging by S/N, I can't find an online lookup for the serial number), the other is 2011. IIRC, Rolf offered us some fibers so we may want to take him up on that. We may also be able to use copper cables if the distances b/w server and expansion chassis are short.
- The units are fitted with a +24V DC input power connector and not the AC power supplies that we have on all the rest of the chassis. This is probably just gonna be a matter of convenience, whether we want to stick to this scheme or revert to the AC input adaptor we have on all the other units. idk what the current draw will be from the Sorensen - I tested that the boards get power, and with noi ADCs/DACs/BIOs, the chassis draws ~1A (read off from DCPS display, not measured with a DMM). ~Half of this is for the cooling fans It seems like KT @ LLO has offered to ship AC power supplies so maybe we want to take them up on that offer.
- Without the host side OSSI PCIe card, timing interface board, and supermicro servers that actually have enough PCIe slots, we still can't actually run any meaningful test. I ran just a basic diagnostic that the chassis can be powered on, and the indicator LEDs and cooling fans run.
- Some photos of the contents are here. The units are stored along the east arm pending installation.
> Koji,
>
> Barebones on this order.
>
> 1. Main PCIe board
> 2. Backplane (Interface board)
> 3. Power Board
> 4. Fiber (One Stop) Interface Card, chassis side only
> 5. Two One Stop Fibers
> 6. No Timing Interface
> 7. No Binary Cards.
> 8. No ADC or DAC cards
>
> Fil Clara
>
|
15764
|
Thu Jan 14 12:19:43 2021 |
Jon | Update | CDS | Expansion chassis from LHO |
That's fine, we didn't actually request those. We bought and already have in hand new PCIe x4 cables for the chassis-host connection. They're 3 m copper cables, which was based on the assumption of the time that host and chassis would be installed in the same rack.
Quote: |
- Regarding the fibers - one of the fibers is pre-2012. These are known to fail (according to Rolf). One of the two that LHO shipped is from 2012 (judging by S/N, I can't find an online lookup for the serial number), the other is 2011. IIRC, Rolf offered us some fibers so we may want to take him up on that. We may also be able to use copper cables if the distances b/w server and expansion chassis are short.
|
|
15765
|
Thu Jan 14 12:32:28 2021 |
gautam | Update | CDS | Rogue master may be doing something good? |
I think the "Rogue Master" setting on the RFM network may be doing some good. 5 mins, ago, all the CDS indicators were green, but I noticed an amber light on the c1rfm screen just now (amber = warning). Seems like at GPS time 1294691182, there was some kind of error on the RFM network. But the network hasn't gone down. I can clear the amber flag by running the global diag reset. Nevertheless, the upgrade of all RT systems to Dolphin should not be de-prioritized I think. |
Attachment 1: Screenshot_2021-01-14_12-35-52.png
|
|
15766
|
Fri Jan 15 15:06:42 2021 |
Jon | Update | CDS | Expansion chassis from LHO |
Koji asked me assemble a detailed breakdown of the parts received from LHO, which I do based on the high-res photos that Gautam posted of the shipment.
Parts in hand:
Qty |
Part |
Note(s) |
2 |
Chassis body |
|
2 |
Power board and cooling fans |
As noted in 15763, these have the standard LIGO +24V input connector which we may want to change |
2 |
IO interface backplane |
|
2 |
PCIe backplane |
|
2 |
Chassis-side OSS PCIe x4 card |
|
2 |
CX4 fiber cables |
These were not requested and are not needed |
Parts still needed:
Qty |
Part |
Note(s) |
2 |
Host-side OSS PCIe x4 card |
These were requested but missing from the LHO shipment |
2 |
Timing slave |
These were not originally requested, but we have recently learned they will be replaced at LHO soon |
Issue with PCIe slots in new FEs
Also, I looked into the mix-up regarding the number of PCIe slots in the new Supermicro servers. The motherboard actually has six PCIe slots and is on the CDS list of boards known to be compatible. The mistake (mine) was in selecting a low-profile (1U) chassis that only exposes one of these slots. But at least it's not a fundamental limitation.
One option is to install an external PCIe expansion chassis that would be rack-mounted right above the FE. It is automatically configured by the system BIOS, so doesn't require any special drivers. It also supports hot-swapping of PCIe cards. There are also cheap ribbon-cable riser cards that would allow more cards to be connected for testing, although this is not as great for permanent mounting.
It may still be better to use the machines offered by Keith Thorne from LLO, as they're more powerful anyway. But if there is going to be an extended delay before those can be received, we should be able to use the machines we already have in conjunction with one of these PCIe expansion options. |
15767
|
Fri Jan 15 16:54:57 2021 |
gautam | Update | CDS | Expansion chassis from LHO |
Can you please provide a link to this "list of boards"? The only document I can find is T1800302. In that, under "Basic Requirements" (before considering specific motherboards), it is specified that the processor should be clocked @ >3GHz. The 3 new supermicros we have are clocked at 1.7 GHz. X10SRi-F boards are used according to that doc, but the processor is clocked at 3.6 or 3.2 GHz.
Please also confirm that there are no conflicts w.r.t. the generation of PCIe slots, and the interfaces (Dolphin, OSSI) we are planning to use - the new machines we have are "PCIe 2.0" (though i have no idea if this is the same as Gen 2).
Quote: |
The motherboard actually has six PCIe slots and is on the CDS list of boards known to be compatible.
|
As for the CX4 cable - I still think it's good to have these on hand. Not good to be in a situation later where FE and expansion chassis have to be in different racks, and the copper cable can't be used. |
Attachment 1: Screenshot_2021-01-15_17-00-06.png
|
|
15770
|
Tue Jan 19 13:19:24 2021 |
Jon | Update | CDS | Expansion chassis from LHO |
Indeed T1800302 is the document I was alluding to, but I completely missed the statement about >3 GHz speed. There is an option for 3.4 GHz processors on the X10SRi-F board, but back in 2019 I chose against it because it would double the cost of the systems. At the time I thought I had saved us $5k. Hopefully we can get the LLO machines in the near term---but if not, I wonder if it's worth testing one of these to see whether the performance is tolerable.
Can you please provide a link to this "list of boards"? The only document I can find is T1800302....
|
I confirm that PCIe 2.0 motherboards are backwards compatible with PCIe 1.x cards, so there's no hardware issue. My main concern is whether the obsolete Dolphin drivers (requiring linux kernel <=3.x) will work on a new system, albeit one running Debian 8. The OSS PCIe card is automatically configured by the BIOS, so no external drivers are required for that one.
Please also confirm that there are no conflicts w.r.t. the generation of PCIe slots, and the interfaces (Dolphin, OSSI) we are planning to use - the new machines we have are "PCIe 2.0" (though i have no idea if this is the same as Gen 2).
|
|
15771
|
Tue Jan 19 14:05:25 2021 |
Jon | Configuration | CDS | Updated CDS upgrade plan |
I've produced updated diagrams of the CDS layout, taking the comments in 15476 into account. I've also converted the 40m's diagrams from Omnigraffle ($150/license) to the free, cloud-based platform draw.io. I had never heard of draw.io, but I found that it has most all the same functionality. It also integrates nicely with Google Drive.
Attachment 1: The planned CDS upgrade (2 new FEs, fully replace RFM network with Gen 1 Dolphin IPC)
Attachment 2: The current 40m CDS topology
The most up-to-date diagrams are hosted at the following links:
Please send me any further corrections or omissions. Anyone logged in with LIGO.ORG credentials can also directly edit the diagrams. |
Attachment 1: 40m_CDS_Network_-_Planned.pdf
|
|
Attachment 2: 40m_CDS_Network_-_Current.pdf
|
|
15772
|
Tue Jan 19 15:43:24 2021 |
gautam | Configuration | CDS | Updated CDS upgrade plan |
Not sure if 1Y1 can accommodate both c1sus2 and c1bhd as well as the various electronics chassis that will have to be installed. There may need to be some distribution between 1Y1 and 1Y3. Does Koji's new wiring also specify which racks hold which chassis?
Some minor improvements to the diagram:
- The GPS receiver in 1X7 should be added. All the timing in the lab is synced to the 1pps from this.
- We should add hyperlinks to the various parts datasheets (e.g. Dolphin switch, RFM switch, etc etc) so that the diagram will be truly informative and self-contained.
- Megatron and nodus, but especially chiara (NFS server), should be added to the diagram.
|
15775
|
Wed Jan 20 19:12:16 2021 |
gautam | Update | CDS | SLwebview fixed |
After fixing multiple issues, the model webviews are updating, should be done by tomorrow. It should be obvious from the timestamps on the index page which are the new ones. These new screens are better than the old ones and offer more info/details. Please look at them, and let me know if there are broken links etc. Once we are happy with this new webview, we can archive the old files and clean up the top directory a bit. I don't think this adds anything to the channel accounting effort but it's a nice thing to have up-to-date webviews, I found the LLO ones really useful in setting up the AS WFS model.
BTW, the crontab on megatron is set to run every day at 0844. The process of updating the models is pretty heavy because of whatever MATLAB overhead. Do we really need to have this run every day? I modified the crontab to run every other Saturday, and we can manually run the update when we modify a model. Considering this hasn't been working for ~3 years, I think this is fine, but if anyone has strong preference you can edit the crontab.
If someboy can volunteer to fix the MEDM screenshot that would be useful. |
15778
|
Tue Jan 26 12:59:51 2021 |
Anchal | HowTo | CDS | Acromag wiring investigation |
Taking inspiration from SR785 on how it reads differential signal, I figured that acromag too always need a way to return current through RTN ports always. That must be the reason why everything goes haywire when RTN is not connected to IN-. Now for single ended signals, we can always short RTN to IN- and keep same GND but then we need to be careful in avoiding ground loops. I'm gonna post a wiring diagram in next post to show how if two signal sources connect to each other separately, a GND loop can be formed if we tie each IN- port to RTN on an acromag.
Coming to the issue of reading a differential signal, what SR785 does is that it connects 50 Ohm resistance between Earth GND and differential signal shields (which are supposed to signal GND). In a floating GND setting, SR785 connects a 1 MOhm resistor between input shield and Earth GND. This can be used to read a differential signal through a single BNC cable since the shiled can take arbitrary voltages thanks ti the 1 MOhm resistor.
We can do the same in acromag. Instead of shorting RTN to IN- ports, we can connect them through a large resistor which would let IN- float but will give a path for current to return through RTN ports. Attached here are few scenarios where I connected IN- to RTN throguh wire, 820 Ohms, 10kOhms and 1MOhms in two sub cases where RTN was left open or was shorted to Earth GND. In all cases, the signal was produced by a 9V battery outputing roughly 8.16V. It seems that 10kOhm resistor between RTN and IN- with RTN connected to Earth GND is the best scenario noise wise. I'll post more results and a wiring diagram soon. |
Attachment 1: TestingDifferentialSignalWithFloatingRTNwrtIN-.pdf
|
|
15779
|
Tue Jan 26 15:37:25 2021 |
Anchal | HowTo | CDS | Acromag wiring investigation |
Here I present few wiring diagrams when using Acromag to avoid noisy behavior and ground loops.
Case 1: Only single-ended sources
- Attachment 1 gives a functioning wiring diagram when all sources are single ended.
- One should always short the RTN to IN- pin if the particular GND carried by that signal has not been shorted before to RTN for some other signal.
- So care is required to mark different GNDs of different powersupply separately and follow where they inadvertently get shorted, for example when a photodiode output is connected to FSS Box.
- Acromag should serve as the node of all GNDs concerned and all these grounds must not be connected to Earth GND at power supply ends or in any of the signal sources.
- I think this is a bit complicated thing to do.
Case 2: Some single and some differential sources
- Connect all single ended sources same as above keeping care of not building any ground loops.
- The differential source can be connected to IN+ and IN- pins, but there should be a high resistance path between IN- and RTN. See Attachment 2.
- Why this is the case, I'm not sure since I could not get access to internal wiring of Acromag (no response from them). But I have empirically found this.
- This helps IN- to float with respect to RTN according to the negative signal value. I've found that 10kOhm resistance works good. See 40m/15778.
- If RTN is shorted to nearby Earth GND (assuming none of the other power supply GNDs have been shorted to Earth GND) shows a reduction in noise for differential input. So this is recommended.
- This wiring diagram carries all complexity of previous case along with the fact that RTN and anything connected to it is at Earth GND now.
Case 3: Signal agnostic wiring
- Attachment 3 gives a wiring diagram which mimics the high resistance shorting of RTN to IN- in all ports regardless of the kind of signal it is used for reading.
- In this case, instead of being the node of the star configuration for GND, acromag gets detached from any ground loops.
- All differences in various GNDs would be kept by the power supplies driving small amounts of current through the 10 kOhm resistors.
- This is a much simpler wiring diagram as it avoids shorting various signal sources or their GNDs with each other, avoiding some of the ground loops.
Edit Wed Jan 27 13:38:19 2021 :
This solution is not acceptable as well. Even if it is successfull in reading the value, connecting resistor between IN- and RTN will not break the ground loops and the issue of ground loops will persist. Further, IN- connection to RTN breaks the symmetry between IN- and IN+, and hence reduces the common mode rejection which is the intended purpose of differential signal anyways. I'll work more on this to find a way to read differential signals without connecitng IN- and RTN. My first guess is that it would need the GND on the source end to be connected to EarthGND and RTN on acromag end to be connected to EarthGND as well.
|
Attachment 1: GeneralLabWiring.pdf
|
|
Attachment 2: GeneralLabWiring2.pdf
|
|
Attachment 3: GeneralLabWiring3.pdf
|
|
15785
|
Fri Jan 29 17:57:17 2021 |
Anchal | HowTo | CDS | Acromag wiring investigation |
I found a white paper from Acromag which discusses how to read differential signal using Acromag units. The document categorically says that differential signals are always supposed to be transmitted in three wires. I provides the two options of either using the RTN to connect to the signal ground (as done in Attachment 3) or locally place 10k-100k resistors between return and IN+ and IN- both (Attachment 2).
I have provided possible scenarios for these.
Using two wires to carry differential signal (Attachment 1):
- I assume this is our preferential way to connect.
- We can also assume all single-ended inputs as differential and do a signal condition agnostic wiring.
- Attachment 3 show what were the results for different values of resistors when a 2Hz 0.5V amplitude signal from SR785 which as converted to differential signal using D1900068 was measured by acromag.
- The connection to RTN is symmetrical for both inputs.
Using three wires to carry differential signal (Attachment 2):
- This is recommended method by the document in which it asks to carry the GND from signal source and connect it to RTN.
- If we use this, we'll have to be very cautious on what GND has been shorted through the acromag RTN terminals.
- This would probably create a lot of opportunities for ground loops to form.
Using an acromag card without making any connection with RTN is basically not allowed as per this document. |
Attachment 1: GeneralLabWiringDiffWith2Wires.pdf
|
|
Attachment 2: GeneralLabWiringDiffWith3Wires.pdf
|
|
Attachment 3: DiffReadResistorbtwnINandRTN.pdf
|
|
15791
|
Tue Feb 2 23:29:35 2021 |
Koji | Update | CDS | CDS crash and CDS/IFO recovery |
I worked around the racks and the feedthru flanges this afternoon and evening. This inevitably crashed c1lsc real-time process.
Rebooting c1lsc caused multiple crashes (as usual) and I had to hard reboot c1lsc/c1sus/c1ioo
This made the "DC" indicator of the IOPs for these hosts **RED**.
This looked like the usual timing issue. It looked like "ntpdate" is not available in the new system. (When was it updated?)
The hardware clock (RTC) of these hosts are set to be PST while the functional end host showed UTC. So I copied the time of the UTC time from the end to the vertex machines.
For the time adjustment, the standard "date" command was used
> sudo date -s "2021-02-03 07:11:30"
This made the trick. Once IOP was restarted, the "DC" indicators returned to **Green**, restarting the other processes were straight forward and now the CDS indicators are all green.
controls@ c1iscex :~ 0$ timedatectl
Local time: Wed 2021-02-03 07:35:12 UTC
Universal time: Wed 2021-02-03 07:35:12 UTC
RTC time: Wed 2021-02-03 07:35:26
Time zone: Etc/UTC (UTC, +0000)
NTP enabled: yes
NTP synchronized: no
RTC in local TZ: no
DST active: n/a
NTP synchronization is not active. Is this OK?
With the recovered CDS, the IMC was immediately locked and the autolocker started to function after a few pokes (like manually running of the "mcup" script). However, I didn't see any light on the AS/REF cameras as well as the test mass faces. I'm sure the IMC alignment is OK. This means the TTs are not well aligned.
So, burtrestored c1assepics with 12:19 snapshot. This immediately brought the spots on the REFL/AS.
Then the arm were aligned, locked, and ASSed. I tried to lock the FP arms. The transmissions were at the level of 0.1~0.3. So some manual alignment of ITMY and BS were necessary. After having the TRs of ~0.8, I still could not lock the arms. The signs of the servo gains were flipped to -0.143 for X arm and -0.012 for Y arm, and the arms were locked. ASS worked well and the ASS offsets were offloaded to the SUSs.
|
15792
|
Wed Feb 3 15:24:52 2021 |
gautam | Update | CDS | CDS crash and CDS/IFO recovery |
Didn't get a chance to comment during the meeting - This was almost certainly a coincidence. I have never had to do this - I assert, based on the ~10 labwide reboots I have had to do in the last two years, that whether the timing errors persist on reboot or not is not deterministic. But this is beyond my level of CDS knowledge and so I'm happy for Rolf / Jamie to comment. I use the reboot script - if that doesn't work, I use it again until the systems come back without any errors.
Quote: |
This looked like the usual timing issue. It looked like "ntpdate" is not available in the new system. (When was it updated?)
The hardware clock (RTC) of these hosts are set to be PST while the functional end host showed UTC. So I copied the time of the UTC time from the end to the vertex machines.
For the time adjustment, the standard "date" command was used
> sudo date -s "2021-02-03 07:11:30"
This made the trick. Once IOP was restarted, the "DC" indicators returned to **Green**, restarting the other processes were straight forward and now the CDS indicators are all green.
|
I don't think this is a problem, the NTP synchronization is handled by timesyncd now.
Quote: |
NTP synchronization is not active. Is this OK?
|
I defer restoring the LSC settings etc since I guess there is not expected to be any interferometer activity for a while. |
15794
|
Wed Feb 3 18:53:31 2021 |
Koji | Update | CDS | CDS crash and CDS/IFO recovery |
Really!? I didn't reboot the machines between "sudo date" and "rtcds start c1x0*". I tried rtcds. If it didn't work, it used date. Then tried rtcds. (repeat) The time was not synched at all wrt the time zones and also the time. There were 1~3 sec offset besides the TZ problem.
|
15795
|
Wed Feb 3 21:28:02 2021 |
gautam | Update | CDS | CDS crash and CDS/IFO recovery |
I am just reporting my experience - this may be a new failure mode but I don't think so. In the new RTCDS, the ntp server for the FEs are the FB, to which they are synced by timesyncd. The FB machine itself has the ntpd service installed, and so is capable of synching to an NTP server over the internet, but also serving as an NTP server for the FEs. The timesyncd daemon may not have started correctly, or the NTP serving from FB got interrupted (for example), but that's all just speculation. |
15833
|
Mon Feb 22 14:53:47 2021 |
gautam | Update | CDS | dtt-ezca-tools installed on rossa |
The defaults cds-crtools didn't come with some of the older ezcautils (like ezcaread, ezcawrite etc). This is now packaged for debian, so I installled them with sudo apt update && sudo apt install dtt-ezca-tools on rossa. Now, we don't have to needlessly substitute the commands in our old shell scripts with the more modern z read, z write etc.
I am wondering if there is a relative implicit minus sign between the z servo and ezcaservo commands... |
15842
|
Wed Feb 24 22:13:47 2021 |
Jon | Update | CDS | Planning document for front-end testing |
I've started writing up a rough testing sequence for getting the three new front-ends operational (c1bhd, c1sus2, c1ioo). Since I anticipate this plan undergoing many updates, I've set it up as a Google doc which everyone can edit (log in with LIGO.ORG credentials).
Link to planning document
Please have a look and add any more tests, details, or concerns. I will continue adding to it as I read up on CDS documentation. |
15843
|
Thu Feb 25 14:30:05 2021 |
gautam | Update | CDS | new c1bhd setup - diskless boot |
I've set up one 2U server unit received from KT in the control room area (the fans in it are pretty loud but other than that, no major issues with it being tested here). The IPMI interface is enabled and the machine is also hooked up to the martian network for diskless boot (usual login creds). I also installed a Dolphin interface card and the one-stop-systems host side card, and both seem to be recognized (the OSSI card has the "PWR" LED on, the Dolphin card shows up in the list of PCIe devices, but has no LEDs on at the moment). I actually can't find the OSSI card in the list of PCI devices, but maybe I'm not looking for the right device ID, or it needs the cable to be connected to the I/O chassis side to be recognized. Anyways, let the testing begin.
The machine previously christened c1bhd has been turned off and completely disconnected from the martian network (though I didn't bother removing it from the rack for now).
BTW - I think most of our 19" racks are deformed from years of loading - I spent 5 mins trying to install the rails (at 1Y1 and 1X7) to mount the supermicro on, and couldn't manage it. I could be missing some technique/trick, but i don't think so. |
15872
|
Fri Mar 5 17:48:25 2021 |
Jon | Update | CDS | Front-end testing |
Today I moved the c1bhd machine from the control room to a new test area set up behind (west of) the 1X6 rack. The test stand is pictured in Attachment 1. I assembled one of the new IO chassis and connected it to the host.
I/O Chassis Assembly
- LIGO-style 24V feedthrough replaced with an ATX 650W switching power supply
- Timing slave installed
- Contec DO-1616L-PE card installed for timing control
- One 16-bit ADC and one 32-channel DO module were installed for testing
The chassis was then powered on and LED lights illuminated indicating that all the components have power. The assembled chassis is pictured in Attachment 2.
Chassis-Host Communications Testing
Following the procedure outlined T1900700, the system failed the very first test of the communications link between chassis and host, which is to check that all PCIe cards installed in both the host and the expansion chassis are detected. The Dolpin host adapter card is detected:
07:06.0 PCI bridge: Stargen Inc. Device 0102 (rev 02) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=07, secondary=0e, subordinate=0e, sec-latency=0
I/O behind bridge: 00002000-00002fff
Prefetchable memory behind bridge: 00000000c0200000-00000000c03fffff
Capabilities: [40] Power Management version 2
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [60] Express Downstream Port (Slot+), MSI 00
Capabilities: [80] Subsystem: Device 0000:0000
Kernel driver in use: pcieport
However the OSS PCIe adapter card linking the host to the IO chassis was not detected, nor were any of the cards in the expansion chassis. Gautam previously reported that the OSS card was not detected by the host (though it was not connected to the chassis then). Even now connected to the IO chassis, the card is still not detected. On the chassis-side OSS card, there is a red LED illuminated indicating "HOST CARD RESET" as pictured in Attachment 3. This may indicate a problem with the card on the host side. Still more debugging to be done. |
Attachment 1: image_67203585.JPG
|
|
Attachment 2: image_67216641.JPG
|
|
Attachment 3: image_17185537.JPG
|
|
15890
|
Tue Mar 9 16:52:47 2021 |
Jon | Update | CDS | Front-end testing |
Today I continued with assembly and testing of the new front-ends. The main progress is that the IO chassis is now communicating with the host, resolving the previously reported issue.
Hardware Issues to be Resolved
Unfortunately, though, it turns out one of the two (host-side) One Stop Systems PCIe cards sent from Hanford is bad. After some investigation, I ultimately resolved the problem by swapping in the second card, with no other changes. I'll try to procure another from Keith Thorne, along with some spares.
Also, two of the three switching power supplies sent from Livingston (250W Channel Well PSG400P-89) appear to be incompatible with the Trenton BPX6806 PCIe backplanes in these chassis. The power supply cable has 20 conductors and the connector on the board has 24. The third supply, a 650W Antec EA-650, does have the correct cable and is currently powering one of the IO chassis. I'll confirm this situation with Keith and see whether they have any more Antecs. If not, I think these supplies can still be bought (not obsolete).
I've gone through all the hardware we've received, checked against the procurement spreadsheet. There are still some missing items:
- 18-bit DACs (Qty 14; but 7 are spares)
- ADC adapter boards (Qty 5)
- DAC adapter boards (Qty 9)
- 32-channel DO modules (Qty 2/10 in hand)
Testing Progress
Once the PCIe communications link between host and IO chassis was working, I carried out the testing procedure outlined in T1900700. This performs a series checks to confirm basic operation/compatibility of the hardware and PCIe drivers. All of the cards installed in both the host and the expansion chassis are detected and appear correctly configured, according to T1900700. In the below tree, there is one ADC, one 16-ch DIO, one 32-ch DO, and one DolphinDX card:
+-05.0-[05-20]----00.0-[06-20]--+-00.0-[07-08]----00.0-[08]----00.0 Contec Co., Ltd Device 86e2
| +-01.0-[09]--
| +-03.0-[0a]--
| +-08.0-[0b-15]----00.0-[0c-15]--+-02.0-[0d]--
| | +-03.0-[0e]--
| | +-04.0-[0f]--
| | +-06.0-[10-11]----00.0-[11]----04.0 PLX Technology, Inc. PCI9056 32-bit 66MHz PCI <-> IOBus Bridge
| | +-07.0-[12]--
| | +-08.0-[13]--
| | +-0a.0-[14]--
| | \-0b.0-[15]--
| \-09.0-[16-20]----00.0-[17-20]--+-02.0-[18]--
| +-03.0-[19]--
| +-04.0-[1a]--
| +-06.0-[1b]--
| +-07.0-[1c]--
| +-08.0-[1d]--
| +-0a.0-[1e-1f]----00.0-[1f]----00.0 Contec Co., Ltd Device 8632
| \-0b.0-[20]--
\-08.0-[21-2a]--+-00.0 Stargen Inc. Device 0101
\-00.1-[22-2a]--+-00.0-[23]--
+-01.0-[24]--
+-02.0-[25]--
+-03.0-[26]--
+-04.0-[27]--
+-05.0-[28]--
+-06.0-[29]--
\-07.0-[2a]--
Standalone Subnet
Before I start building/testing RTCDS models, I'd like to move the new front ends to an isolated subnet. This is guaranteed to prevent any contention with the current system, or inadvertent changes to it.
Today I set up another of the Supermicro servers sent by Livingston in the 1X6 test stand area. The intention is for this machine to run a cloned, bootable image of the current fb1 system, allowing it to function as a bootserver and DAQ server for the FEs on the subnet.
However, this hard disk containing the fb1 image appears to be corrupted and will not boot. It seems to have been sitting disconnected in a box since ~2018, which is not a stable way to store data long term. I wasn't immediately able to recover the disk using fsck. I could spend some more time trying, but it might be most time-effective to just make a new clone of the fb1 system as it is now. |
Attachment 1: image_72192707.JPG
|
|
15904
|
Thu Mar 11 14:27:56 2021 |
gautam | Update | CDS | timesync issue? |
I have recently been running into hitting the 4MB/s data rate limit on testpoints - basically, I can't run DTT TF and spectrum measurements that I was able to while locking the interferometer, which I certainly was able to this time last year. AFAIK, the major modification made was the addition of 4 DQ channels for the in-air BHD experiment - assuming the data is transmitted as double precision numbers, i estimate the additional load due to this change was ~500KB/s. Probably there is some compression so it is a bit more efficient (as this naive calc would suggest we can only record 32 channels and I counted 41 full rate channels in the model), but still, can't think of anything else that has changed. Anyway, I removed the unused parts and recompiled/re-installed the models (c1lsc and c1omc). Holding off on a restart until I decide I have the energy to recover the CDS system from the inevitable crash. For documentation I'm also attaching screenshot of the schematic of the changes made.
Anyway, the main point of this elog is that at the compilation stage, I got a warning I've never seen before:
Building front-end Linux kernel module c1lsc...
make[1]: Warning: File 'GNUmakefile' has modification time 13 s in the future
make[1]: warning: Clock skew detected. Your build may be incomplete.
This prompted me to check the system time on c1lsc and FB - you can see there is a 1 minute offset (it is not a delay in me issuing the command to the two machines)! I am suspecting this NTP action is the reason. So maybe a model reboot is in order. Sigh |
Attachment 1: timesync.png
|
|
Attachment 2: c1lscMods.png
|
|
15905
|
Thu Mar 11 18:46:06 2021 |
gautam | Update | CDS | cds reboot |
Since Koji was in the lab I decided to bite the bullet and do the reboot. I've modified the reboot script - now, it prompts the user to confirm that the time recognized by the FEs are the same (use the IOP model's status screen, the GPSTIME is updated live on the upper right hand corner). So you would do sudo date --set="Thu 11 Mar 2021 06:48:30 PM UTC" for example, and then restart the IOP model. Why is this necessary? Who knows. It seems to be a deterministic way of getting things back up and running for now so we have to live with it. I will note that this was not a problem between 2017 and 2020 Oct, in which time I've run the reboot script >10 times without needing to take this step. But things change (for an as of yet unknown reason) and we must adapt. Once the IOPs all report a green "DC" status light on the CDS overview screen, you can let the script take you the rest of the way again.
The main point of this work was to relax the data rate on the c1lsc model, and this worked. It now registers ~3.2 MB/s, down from the ~3.8 MB/s earlier today. I can now measure 2 loop TFs simultaneously. This means that we should avoid adding any more DQ channels to the c1lsc model (without some adjustment/downsampling of others).
Quote: |
Holding off on a restart until I decide I have the energy to recover the CDS system from the inevitable crash.
|
|
Attachment 1: CDSoverview.png
|
|
15910
|
Fri Mar 12 03:28:51 2021 |
Koji | Update | CDS | cds reboot |
I want to emphasize the followings:
- FB's RTC (real-time clock/motherboard clock) is synchronized to NTP time.
- The other RT machines are not synchronized to NTP time.
- My speculation is that the RT machine timestamp is produced from the local RT machine time because there is no IRIG-B signal provided. -> If the RTC is more than 1 sec off from the FB time, the timestamp inconsistency occurs. Then it induces "DC" error.
- For now, we have to use "date" command to match the local RTC time to the FB's time.
- So: If NTP synchronization is properly implemented for the RT machines, we will be free from this silly manual time adjustment.
|
15911
|
Fri Mar 12 11:02:38 2021 |
gautam | Update | CDS | cds reboot |
I looked into this a bit today morning. I forgot exactly what time we restarted the machines, but looking at the timesyncd logs, it appears that the NTP synchronization is in fact working (log below is for c1sus, similar on other FEs):
-- Logs begin at Fri 2021-03-12 02:01:34 UTC, end at Fri 2021-03-12 19:01:55 UTC. --
Mar 12 02:01:36 c1sus systemd[1]: Starting Network Time Synchronization...
Mar 12 02:01:37 c1sus systemd-timesyncd[243]: Using NTP server 192.168.113.201:123 (ntpserver).
Mar 12 02:01:37 c1sus systemd[1]: Started Network Time Synchronization.
Mar 12 02:02:09 c1sus systemd-timesyncd[243]: Using NTP server 192.168.113.201:123 (ntpserver).
So, the service is doing what it is supposed to (using the FB, 192.168.113.201, as the ntpserver). You can see that the timesync was done a couple of seconds after the machine booted up (validated against "uptime"). Then, the service is periodically correcting drifts. idk what it means that the time wasn't in sync when we check the time using timedatectl or similar. Anyway, like I said, I have successfully rebooted all the FEs without having to do this weird time adjustment >10 times.
I guess what I am saying is, I don't know what action is necessary for "implementing NTP synchronization properly", since the diagnostic logfile seems to indicate that it is doing what it is supposed to.
More worryingly, the time has already drifted in < 24 hours.
Quote: |
I want to emphasize the followings:
- FB's RTC (real-time clock/motherboard clock) is synchronized to NTP time.
- The other RT machines are not synchronized to NTP time.
- My speculation is that the RT machine timestamp is produced from the local RT machine time because there is no IRIG-B signal provided. -> If the RTC is more than 1 sec off from the FB time, the timestamp inconsistency occurs. Then it induces "DC" error.
- For now, we have to use "date" command to match the local RTC time to the FB's time.
- So: If NTP synchronization is properly implemented for the RT machines, we will be free from this silly manual time adjustment.
|
|
Attachment 1: timeDrift.png
|
|
15924
|
Tue Mar 16 16:27:22 2021 |
Jon | Update | CDS | Front-end testing |
Some progress today towards setting up an isolated subnet for testing the new front-ends. I was able to recover the fb1 backup disk using the Rescatux disk-rescue utility and successfully booted an fb1 clone on the subnet. This machine will function as the boot server and DAQ server for the front-ends under test. (None of these machines are connected to the Martian network or, currently, even the outside Internet.)
Despite the success with the framebuilder, front-ends cannot yet be booted locally because we are still missing the DHCP and FTP servers required for network booting. On the Martian net, these processes are running not on fb1 but on chiara. And to be able to compile and run models later in the testing, we will need the contents of the /opt/rtcds directory also hosted on chiara.
For these reasons, I think it will be easiest to create another clone of chiara to run on the subnet. There is a backup disk of chiara and I attempted to boot it on one of the LLO front-ends, but without success. The repair tool I used to recover the fb1 disk does not find a problem with the chiara disk. However, the chiara disk is an external USB drive, so I suspect there could be a compatibility problem with these old (~2010) machines. Some of them don't even recognize USB keyboards pre-boot-up. I may try booting the USB drive from a newer computer.
Edit: I removed one of the new, unused Supermicros from the 1Y2 rack and set it up in the test stand. This newer machine is able to boot the chiara USB disk without issue. Next time I'll continue with the networking setup. |
15925
|
Tue Mar 16 19:04:20 2021 |
gautam | Update | CDS | Front-end testing |
Now that I think about it, I may only have backed up the root file system of chiara, and not/home/cds/ (symlinked to /opt/ over NFS). I think we never revived the rsync backup to LDAS after the FB fiasco of 2017, else that'd have been the most convenient way to get files. So you may have to resort to some other technique (e.g. configure the second network interface of the chiara clone to be on the martian network and copy over files to the local disk, and then disconnect the chiara clone from the martian network (if we really want to keep this test stand completely isolated from the existing CDS network) - the /home/cds/ directory is rather large IIRC, but with 2TB on the FB clone, you may be able to get everything needed to get the rtcds system working). It may then be necessary to hook up a separate disk to write frames to if you want to test that part of the system out.
Good to hear the backup disk was able to boot though!
Quote: |
And to be able to compile and run models later in the testing, we will need the contents of the /opt/rtcds directory also hosted on chiara.
For these reasons, I think it will be easiest to create another clone of chiara to run on the subnet. There is a backup disk of chiara and I attempted to boot it on one of the LLO front-ends, but without success.
|
|