ID |
Date |
Author |
Type |
Category |
Subject |
13297
|
Tue Sep 5 23:02:37 2017 |
gautam | Update | CDS | slow machine bootfest | MC autolocker was not working - PCdrive was railed at its upper rail for ~2 hours judging by the wall StripTool trace. I tried restarting the init processes on megatron, but that didn't fix the problem. The reason seems to have been related to c1iool0 failing - after keying the crate, autolocker came back fine and MC caught lock almost immediately.
Additionally, c1susaux, c1auxex,c1auxey and c1iscaux are also down. I'm not planning on using the IFO tonight so I am not going to reboot these now.
|
13312
|
Fri Sep 15 15:54:28 2017 |
gautam | Update | CDS | FB wiper script | A wiper script is not yet set up for our new Frame-Builder. The disk usage is ~80% now, so I think we should start running a wiper script that manages overall disk usage and deletes old frame files to this end.
From what I could find on the elog, the way this was done was by running a cron job on FB. There is a perl script, /opt/rtcds/caltech/c1/target/fb/wiper.pl, which from what I could understand, runs a bunch of du commands on different directories to determine if there is a need to delete any files.
I copied this script over to /opt/rtcds/caltech/c1/target/daqd/wiper.pl. This is the directory in which all the new FB stuff resides. Conveniently, the script has a "dry-run" option, which I tried running on FB1. However, I get the following error message:
Fri Sep 15 15:44:45 PDT 2017
Dry run, will not remove any files!!!
You need to rerun this with --delete argument to really delete frame files
Directory disk usage:
/frames/trend/minute_rawk
Combined 0k or 0m or 0Gb
Illegal division by zero at ./wiper.pl line 98.
So it would seem that for some reason, the du commands aren't working. From what I could tell, there aren't any directory paths specific to the old FB machine that need to be changed. I believe the script was working prior to the FB disk crash - unfortunately it doesn't look like this script was under version control but I don't think any changes have been made to this script.
Before I go down a Perl rabbit hole, has anyone seen such an error or is aware of some reason why this might not work on the new FB? Am I even using the correct scripts? |
13317
|
Mon Sep 18 17:17:49 2017 |
gautam | Update | CDS | FB wiper script | After trying to debug this issue using the Perl debugger, I concluded that the problem is in the part of the code that splits the output of the "du" command into directory and disk usage. For whatever, reason, this isn't working. The version of perl running on the new FB1 machine is 5.20.2, whereas I suspect the version running on the old FB machine was 5.14.2 (which is the version on all the Ubuntu 12 workstations and megatron). Unclear whether downgrading the Perl version is the right way to go.
The FB1 disk is now getting close to full, the usage is up to 85% today.
Quote: |
Before I go down a Perl rabbit hole, has anyone seen such an error or is aware of some reason why this might not work on the new FB? Am I even using the correct scripts?
|
|
13318
|
Mon Sep 18 17:30:54 2017 |
Chris | Update | CDS | FB wiper script | Attached is the version of the wiper script we use on the CryoLab cymac. It works with perl v5.20.2. Is this different from what you have? |
Attachment 1: wiper.pl
|
#!/usr/bin/perl
use File::Basename;
print "\n" . `date` . "\n";
# Dry run, do not delete anything
$dry_run = 1;
if ($ARGV[0] eq "--delete") { $dry_run = 0; }
print "Dry run, will not remove any files!!!\n" if $dry_run;
... 184 more lines ...
|
13319
|
Mon Sep 18 17:51:26 2017 |
gautam | Update | CDS | FB wiper script | It is a little different - specifically, the way the splitting of the output of the "du" command into disk usage and directory is different (see Attachment #1). Apart from this, some of the parameters (e.g. what percentage to keep free) are different.
I changed the percentages to match what we had here, and edited a couple of other lines to print out the files that will be deleted. The dry run seemed to work okay, it produced the output below. Not sure why "df -h" reports a different use percentage though...
Since the script seems to be working now, I am going to set it up on FB1's crontab. Thanks Chris!.
controls@fb1:/opt/rtcds/caltech/c1/target/daqd 0$ ./wiper.pl
Mon Sep 18 17:47:06 PDT 2017
Dry run, will not remove any files!!!
You need to rerun this with --delete argument to really delete frame files
Directory disk usage:
/frames/trend/minute_raw 47126124k
/frames/trend/minute 22900668k
/frames/trend/second 760359168k
/frames/full 19337278516k
Combined 20167664476k or 19694984m or 19233Gb
/frames size 25097525144k at 80.36%
/frames is below keep value of 85.00%
Will not delete any files
df reported usage 80.36%
controls@fb1:/opt/rtcds/caltech/c1/target/daqd 0$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda4 2.0T 1.7T 152G 92% /
udev 10M 0 10M 0% /dev
tmpfs 13G 177M 13G 2% /run
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/sda2 19G 3.7G 14G 21% /var
/dev/sda1 461M 65M 373M 15% /boot
/dev/sdb1 24T 19T 3.5T 85% /frames
192.168.113.104:/home/cds/rtcds 2.0T 1.6T 291G 85% /opt/rtcds
192.168.113.104:/home/cds/rtapps 2.0T 1.6T 291G 85% /opt/rtapps
tmpfs 6.3G 0 6.3G 0% /run/user/1001
Quote: |
Attached is the version of the wiper script we use on the CryoLab cymac. It works with perl v5.20.2. Is this different from what you have?
|
|
Attachment 1: perlDiff.png
|
|
13320
|
Mon Sep 18 18:40:34 2017 |
gautam | Update | CDS | FB wiper script | I did a further check on the wiper script by changing the "percent_keep" from 85.0 to 75.0, and running the script in "dry_run" mode again. The script then output to console the names of all the files it would delete in order to free up the required amount of space (but didn't actually delete any files as it was a dry run). Seemed to be sensible.
To set up the cron job, I did the following on FB1:
- crontab -e opened up the crontab
- Copied over a script called "wiper.cron" from /opt/rtcds/caltech/c1/target/fb to /opt/rtcds/caltech/c1/target/daqd. This essentially contains a bunch of instructions to run the wiper script with the --delete flag, and write the console output to a log file.
- Added the following line: 33 3 * * * /opt/rtcds/caltech/c1/target/daqd/wiper.cron. So the cron job should be executed at 3:33AM everyday.
- The cron daemon seems to be running - sudo systemctl status cron.service yields the following output:
controls@fb1:~ 0$ sudo systemctl status cron.service
● cron.service - Regular background program processing daemon
Loaded: loaded (/lib/systemd/system/cron.service; enabled)
Active: active (running) since Mon 2017-09-18 18:16:58 PDT; 27min ago
Docs: man:cron(8)
Main PID: 30183 (cron)
CGroup: /system.slice/cron.service
└─30183 /usr/sbin/cron -f
Sep 18 18:16:58 fb1 cron[30183]: (CRON) INFO (Skipping @reboot jobs -- not system startup)
Sep 18 18:17:01 fb1 CRON[30205]: pam_unix(cron:session): session opened for user root by (uid=0)
Sep 18 18:17:01 fb1 CRON[30206]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Sep 18 18:17:01 fb1 CRON[30205]: pam_unix(cron:session): session closed for user root
Sep 18 18:25:01 fb1 CRON[30820]: pam_unix(cron:session): session opened for user root by (uid=0)
Sep 18 18:25:01 fb1 CRON[30821]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep 18 18:25:01 fb1 CRON[30820]: pam_unix(cron:session): session closed for user root
Sep 18 18:35:01 fb1 CRON[31515]: pam_unix(cron:session): session opened for user root by (uid=0)
Sep 18 18:35:01 fb1 CRON[31516]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep 18 18:35:01 fb1 CRON[31515]: pam_unix(cron:session): session closed for user root
- crontab -l on FB1 now shows the following:
controls@fb1:~ 0$ crontab -l
# Edit this file to introduce tasks to be run by cron.
#
# Each task to run has to be defined through a single line
# indicating with different fields when the task will be run
# and what command to run for the task
#
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').#
# Notice that tasks will be started based on the cron's system
# daemon's notion of time and timezones.
#
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
#
# For example, you can run a backup of all your user accounts
# at 5 a.m every week with:
# 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/
#
# For more information see the manual pages of crontab(5) and cron(8)
#
# m h dom mon dow command
33 3 * * * /opt/rtcds/caltech/c1/target/daqd/wiper.cron
Let's see if this works.
Quote: |
Since the script seems to be working now, I am going to set it up on FB1's crontab. Thanks Chris!.
|
|
13331
|
Tue Sep 26 13:40:45 2017 |
gautam | Update | CDS | NDS2 server restarted on megatron | Gabriele reported problems with the nds2 server again. I restarted it again.
update: had to do it again at 1730 today - unclear why nds2 is so flaky. Log files don't suggest anything obvious to me...
Quote: |
I was unable to download data using nds2. Gabriele had reported similar problems a week ago but I hadn't followed up on this.
I repeated steps 5-7 from elog 13161, and now it seems that I can get data from the nds2 servers again. Unclear why the nds2 server had to be restarted. I wonder if this is somehow related to the mysterious acromag EPICS server tmux session dropout.
|
|
13332
|
Tue Sep 26 15:55:20 2017 |
gautam | Update | CDS | 40m files backup situation | Backups of the root filesystems of chiara and nodus are underway right now. I am backing them up to the 1 TB LaCie external hard drives we recently acquired.
I first initialized the drives by hooking them up to my computer and running the setup.app file. After this, plugging the drive into the respective machine and running lsblk, I was able to see the mount point of the external drive. To actually initialize the backup, I ran the following command from a tmux session called ddBackupLaCie:
sudo dd if=/dev/sda of=/dev/sdb bs=64K conv=noerror,sync
Here, /dev/sda is the disk with the root filesystem, and /dev/sdb is the external hard-drive. The installed version of dd is 8.13, and from version 8.21 onwards, there is a progress flag available, but I didn't want to go through the exercise of upgrading coreutils on multiple machines, so we just have to wait till the backup finishes.
We also wanted to do a backup of the root of FB1 - but I'm not sure if dd will work with the external hard drive, because I think it requires the backup disk size (for us, 1TB) to be >= origin disk size (which on FB1, according to df -h, is 2TB). Unsure why the root filesystem of FB is so big, I'm checking with Jamie what we expect it to be. Anyways we have also acquired 2TB HGST SATA drives, which I will use if the LaCie disks aren't an option.
|
13339
|
Thu Sep 28 10:33:46 2017 |
gautam | Update | CDS | 40m files backup situation | After consulting with Jamie, we reached the conclusion that the reason why the root of FB1 is so huge is because of the way the RAID for /frames is setup. Based on my googling, I couldn't find a way to exclude the nfs stuff while doing a backup using dd, which isn't all that surprising because dd is supposed to make an exact replica of the disk being cloned, including any empty space. So we don't have that flexibility with dd. The advantage of using dd is that if it works, we have a plug-and-play clone of the boot disk and root filesystem which we can use in the event of a hard-disk failure.
- One option would be to stop all the daqd processes, unmount /frames, and then do a dd backup of the true boot disk and root filesystem.
- Another option would be to use rsync to do the backup - this way we can selectively copy the files we want and ignore the nfs stuff. I suspect this is what we will have to do for the second layer of backup we have planned, which will be run as a daily cron job. But I don't think this approach will give us a plug-and-play replacement disk in the event of a disk failure.
- Third option is to use one of the 2TB HGST drives, and just do a dd backup - some of this will be /frames, but that's okay I guess.
I am trying option 3 now. dd however does requrie that the destination drive size be >= source drive size - I'm not sure if this is true for the HGST drives. lsblk suggests that the drive size is 1.8TB, while the boot disk, /dev/sda, is 2TB. Let's see if it works.
Backup of chiara is done. I checked that I could mount the external drive at /mnt and access the files. We should still do a check of trying to boot from the LaCie backup disk, need another computer for that.
nodus backup is still not complete according to the console - there is no progress indicator so we just have to wait I guess.
Quote: |
Backups of the root filesystems of chiara and nodus are underway right now. I am backing them up to the 1 TB LaCie external hard drives we recently acquired.
We also wanted to do a backup of the root of FB1 - but I'm not sure if dd will work with the external hard drive, because I think it requires the backup disk size (for us, 1TB) to be >= origin disk size (which on FB1, according to df -h, is 2TB). Unsure why the root filesystem of FB is so big, I'm checking with Jamie what we expect it to be. Anyways we have also acquired 2TB HGST SATA drives, which I will use if the LaCie disks aren't an option.
|
|
13340
|
Thu Sep 28 11:13:32 2017 |
jamie | Update | CDS | 40m files backup situation |
Quote: |
After consulting with Jamie, we reached the conclusion that the reason why the root of FB1 is so huge is because of the way the RAID for /frames is setup. Based on my googling, I couldn't find a way to exclude the nfs stuff while doing a backup using dd, which isn't all that surprising because dd is supposed to make an exact replica of the disk being cloned, including any empty space. So we don't have that flexibility with dd. The advantage of using dd is that if it works, we have a plug-and-play clone of the boot disk and root filesystem which we can use in the event of a hard-disk failure.
- One option would be to stop all the daqd processes, unmount /frames, and then do a dd backup of the true boot disk and root filesystem.
- Another option would be to use rsync to do the backup - this way we can selectively copy the files we want and ignore the nfs stuff. I suspect this is what we will have to do for the second layer of backup we have planned, which will be run as a daily cron job. But I don't think this approach will give us a plug-and-play replacement disk in the event of a disk failure.
- Third option is to use one of the 2TB HGST drives, and just do a dd backup - some of this will be /frames, but that's okay I guess.
|
This is not quite right. First of all, /frames is not NFS. It's a mount of a local filesystem that happens to be on a RAID. Second, the frames RAID is mounted at /frames. If you do a dd of the underlying block device (in this case /dev/sda*, you're not going to copy anything that's mounted on top of it.
What i was saying about /frames is that I believe there is data in the underlying directory /frames that the frames RAID is mounted on top of. In order to not get that in the copy of /dev/sda4 you would need to unmount the frames RAID from /frames, and delete everything from the /frames directory. This would not harm the frames RAID at all.
But it doesn't really matter because the backup disk has space to cover the whole thing so just don't worry about it. Just dd /dev/sda to the backup disk and you'll just be copying the root filesystem, which is what we want. |
13341
|
Thu Sep 28 23:32:38 2017 |
gautam | HowTo | CDS | pyawg | I've modified the __init.py__ file located at /ligo/apps/linux-x86_64/cdsutils-480/lib/python2.7/site-packages/cdsutils/__init__.py so that you can now simply import pyawg from cdsutils . On the control room workstations, iPython is set up such that cdsutils is automatically imported as "cds ". Now this import also includes the pyawg stuff. So to use some pyawg function, you would just do (for example):
exc=cds.awg.ArbitraryLoop(excChan,excit,rate=fs)
One could also explicitly do the import if cdsutils isn't automatically imported:
from cdsutils import awg
pyawg- away!
Linking this useful instructional elog from Chris here: https://nodus.ligo.caltech.edu:8081/Cryo_Lab/1748 |
13342
|
Thu Sep 28 23:47:38 2017 |
gautam | Update | CDS | 40m files backup situation | The nodus backup too is now complete - however, I am unable to mount the backup disk anywhere. I tried on a couple of different machines (optimus , chiara and pianosa ), but always get the same error:
mount: unknown filesystem type 'LVM2_member'
The disk itself is being recognized, and I can see the partitions when I run lsblk, but I can't get the disk to actually mount.
Doing a web-search, I came across a few blog posts that look like the problem can be resolved using the vgchange utility - but I am not sure what exactly this does so I am holding off on trying.
To clarify, I performed the cloning by running
sudo dd if=/dev/sda of=/dev/sdb bs=64K conv=noerror,sync
in a tmux session on nodus (as I did for chiara and FB1 , latter backup is still running). |
13344
|
Fri Sep 29 09:43:52 2017 |
jamie | HowTo | CDS | pyawg |
Quote: |
I've modified the __init.py__ file located at /ligo/apps/linux-x86_64/cdsutils-480/lib/python2.7/site-packages/cdsutils/__init__.py so that you can now simply import pyawg from cdsutils . On the control room workstations, iPython is set up such that cdsutils is automatically imported as "cds ". Now this import also includes the pyawg stuff. So to use some pyawg function, you would just do (for example):
exc=cds.awg.ArbitraryLoop(excChan,excit,rate=fs)
One could also explicitly do the import if cdsutils isn't automatically imported:
from cdsutils import awg
pyawg- away!
Linking this useful instructional elog from Chris here: https://nodus.ligo.caltech.edu:8081/Cryo_Lab/1748
|
? Why aren't you able to just import 'awg' directly? You shouldn't have to import it through cdsutils. Something must be funny with the config. |
13345
|
Fri Sep 29 11:07:16 2017 |
gautam | Update | CDS | 40m files backup situation | The FB1 dd backup process seems to have finished too - but I got the following message:
dd: error writing ‘/dev/sdc’: No space left on device
30523666+0 records in
30523665+0 records out
2000398934016 bytes (2.0 TB) copied, 50865.1 s, 39.3 MB/s
Running lsblk shows the following:
controls@fb1:~ 32$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 23.5T 0 disk
└─sdb1 8:17 0 23.5T 0 part /frames
sda 8:0 0 2T 0 disk
├─sda1 8:1 0 476M 0 part /boot
├─sda2 8:2 0 18.6G 0 part /var
├─sda3 8:3 0 8.4G 0 part [SWAP]
└─sda4 8:4 0 2T 0 part /
sdc 8:32 0 1.8T 0 disk
├─sdc1 8:33 0 476M 0 part
├─sdc2 8:34 0 18.6G 0 part
├─sdc3 8:35 0 8.4G 0 part
└─sdc4 8:36 0 1.8T 0 part
While I am able to mount /dev/sdc1, I can't mount /dev/sdc4, for which I get the error message
controls@fb1:~ 0$ sudo mount /dev/sdc4 /mnt/HGSTbackup/
mount: wrong fs type, bad option, bad superblock on /dev/sdc4,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
Looking at dmesg, it looks like this error is related to the fact that we are trying to clone a 2TB disk onto a 1.8TB disk - it complains about block size exceeding device size.
So if we either have to get a larger disk (4TB?) to do the dd backup, or do the backing up some other way (e.g. unmount /frames RAID, delete everything in /frames, and then do dd , as Jamie suggested). If I understand correctly, unmounting /frames RAID will require that we stop all the daqd processes for the duration of the dd backup
Quote: |
sudo dd if=/dev/sda of=/dev/sdb bs=64K conv=noerror,sync
in a tmux session on nodus (as I did for chiara and FB1 , latter backup is still running).
|
Edit: unmounting /frames won't help, since dd makes a bit for bit copy of the drive being cloned. So we need a drive with size that is >= that of the drive we are trying to clone. On FB1, this is /dev/sda, which has a size of 2TB. The HGST drive we got has an advertised size of 2TB, but looks like actually only 1.8TB is available. So I think we need to order a 4TB drive. |
13349
|
Mon Oct 2 18:08:10 2017 |
gautam | Update | CDS | c1ioo DC errors | I was trying to set up a DAC channel to interface with the AOM driver on the PSL table.
- It would have been most convenient to use channels from c1ioo given proximity to the PSL table.
- Looking at the 1X2 rack, it looked like there were indeed some spare DAC channels available.
- So I thought I'd run a test by adding some TPs to the c1als model (because it seems to have the most head room in terms of CPU time used).
- I added the DAC_0 block from CDS_PARTS library to c1als model (after confirming that the same part existed in the IOP model, c1x03).
- Model recompiled fine (I ran rtcds make c1als and rtcds install c1als on c1ioo).
- However, I got a bunch of errors when I tried to restart the model with rtcds restart c1als. The model itself never came up.
- Looking at dmesg, I saw stuff like
[4072817.132040] c1als: Failed to allocate DAC channel.
[4072817.132040] c1als: DAC local 0 global 16 channel 4 is already allocated.
[4072817.132040] c1als: Failed to allocate DAC channel.
[4072817.132040] c1als: DAC local 0 global 16 channel 5 is already allocated.
[4072817.132040] c1als: Failed to allocate DAC channel.
[4072817.132040] c1als: DAC local 0 global 16 channel 6 is already allocated.
[4072817.132040] c1als: Failed to allocate DAC channel.
[4072817.132040] c1als: DAC local 0 global 16 channel 7 is already allocated.
[4073325.317369] c1als: Setting stop_working_threads to 1
- Looking more closely at the log messages, it seemed like rtcds could not find any DAC cards on c1ioo.
- I went back to 1X2 and looked inside the expansion chassis. I could only find two ADC cards and 1 BIO card installed. The SCSI cable labelled ("DAC 0") running from the rear of the expansion chassis to the 1U SCSI->40pin IDE breakout chassis wasn't actually connected to anything inside the expansion chassis.
- I then undid my changes (i.e. deleted all parts I added in the simulink diagram), and recompiled c1als.
- This time the model came back up but I saw a "0x2000" error in the GDS overview MEDM screen.
- Since there are no DACs installed in the c1ioo expansion chassis, I thought perhaps the problem had to do with the fact that there was a "DAC_0" block in the c1x03 simulink diagram - so I deleted this block, recompiled c1x03, and for good measure, restarted all (three) models on c1ioo.
- Now, however, I get the same 0x2000 error on both the c1x03 and c1als GDS overview MEDM screens (see Attachment #1).
- An elog search revealed that perhaps this error is related to DAQ channels being specified without recording rates (e.g. 16384, 2048 etc). There were a few DAQ channels inside c1als which didn't have recording rates specified, so I added the rates, and restarted the models, but the errors persist.
- According to the RCG runtime diagnostics document, T1100625 (which admittedly is for RCG v 2.7 while we are running v3.4), this error has to do with a mismatch between the DAQ config files read by the RTS and the DAQD system, but I'm not sure how to debug this further.
- I also suspect there is something wrong with the mx processes:
controls@c1ioo:~ 130$ sudo systemctl status mx
● open-mx.service - LSB: starts Open-MX driver
Loaded: loaded (/etc/init.d/open-mx)
Active: failed (Result: exit-code) since Tue 2017-10-03 00:27:32 UTC; 34min ago
Process: 29572 ExecStop=/etc/init.d/open-mx stop (code=exited, status=1/FAILURE)
Process: 32507 ExecStart=/etc/init.d/open-mx start (code=exited, status=1/FAILURE)
Oct 03 00:27:32 c1ioo systemd[1]: Starting LSB: starts Open-MX driver...
Oct 03 00:27:32 c1ioo open-mx[32507]: Loading Open-MX driver (with ifnames=eth1 )
Oct 03 00:27:32 c1ioo open-mx[32507]: insmod: ERROR: could not insert module /opt/3.2.88-csp/open-mx-1.5.4/modules/3.2.88-csp/open-mx.ko: File exists
Oct 03 00:27:32 c1ioo systemd[1]: open-mx.service: control process exited, code=exited status=1
Oct 03 00:27:32 c1ioo systemd[1]: Failed to start LSB: starts Open-MX driver.
Oct 03 00:27:32 c1ioo systemd[1]: Unit open-mx.service entered failed state.
- Not sure if this is related to the DC error though.
|
Attachment 1: c1ioo_CDS_errors.png
|
|
13350
|
Mon Oct 2 18:50:55 2017 |
jamie | Update | CDS | c1ioo DC errors |
Quote: |
- This time the model came back up but I saw a "0x2000" error in the GDS overview MEDM screen.
- Since there are no DACs installed in the c1ioo expansion chassis, I thought perhaps the problem had to do with the fact that there was a "DAC_0" block in the c1x03 simulink diagram - so I deleted this block, recompiled c1x03, and for good measure, restarted all (three) models on c1ioo.
- Now, however, I get the same 0x2000 error on both the c1x03 and c1als GDS overview MEDM screens (see Attachment #1).
|
From page 21 of T1100625, DAQ status "0x2000" means that the channel list is out of sync between the front end and the daqd. This usually happens when you add channels to the model and don't restart the daqd processes, which sounds like it might be applicable here.
It looks like open-mx is loaded fine (via "rtcds lsmod"), even though the systemd unit is complaining. I think this is because the open-mx service is old style and is not intended for module loading/unloading with the new style systemd stuff. |
13351
|
Mon Oct 2 19:03:49 2017 |
gautam | Update | CDS | [Solved] c1ioo DC errors | This did the trick - I simply ran
sudo systemctl restart daqd_*
on FB1, and now all the CDS overview lights are green again.
I thought I had done this already, but I realize that I was supposed to restart the daqd processes on FB1 (which is where they are running) and not on c1ioo .
Thanks Jamie for the speedy resolution!
Quote: |
Quote: |
- This time the model came back up but I saw a "0x2000" error in the GDS overview MEDM screen.
- Since there are no DACs installed in the c1ioo expansion chassis, I thought perhaps the problem had to do with the fact that there was a "DAC_0" block in the c1x03 simulink diagram - so I deleted this block, recompiled c1x03, and for good measure, restarted all (three) models on c1ioo.
- Now, however, I get the same 0x2000 error on both the c1x03 and c1als GDS overview MEDM screens (see Attachment #1).
|
From page 21 of T1100625, DAQ status "0x2000" means that the channel list is out of sync between the front end and the daqd. This usually happens when you add channels to the model and don't restart the daqd processes, which sounds like it might be applicable here.
It looks like open-mx is loaded fine (via "rtcds lsmod"), even though the systemd unit is complaining. I think this is because the open-mx service is old style and is not intended for module loading/unloading with the new style systemd stuff.
|
|
Attachment 1: CDSoverview.png
|
|
13355
|
Tue Oct 3 19:39:10 2017 |
gautam | Update | CDS | slow machine bootfest | Eurocrate key turning reboots for c1susaux, c1auxex,c1auxey, c1iscaux, c1iscaux2 and c1aux. Usual precautions were taken for ITMX. Did burtrestore for c1iscaux andc1iscaux2 in order to restore the LSC PD whitening gains.
Un-related to this work: input pointing into PMC was tweaked as the PMC_REFL spot was pretty bright. |
13356
|
Wed Oct 4 17:18:15 2017 |
gautam | Update | CDS | Free DAC channels in c1lsc | There are at least 5 free DAC channels (4 if you discount the one channel from these that I am hijacking) available in the 1Y2 electronics rack.
Jamie's nice wiring diagram shows the topology - the actual DAC card sits in 1Y3 inside the c1lsc expansion chassis (while the c1lsc frontend itself is in 1X4). The output of the DAC goes via SCSI to an interface box (D080303) and then to some dewhitening/AI boards (D000316). There are a total of 16 DAC channels available, out of which 8 are used for the TTs, 2 are used for the DAFI model, and one is labeleld "From c1ioo 1X2" (I don't know what this one is for). So I'm going to use some of these channels for measuring the coupling of oscillator noise and intensity noise to MICH in the DRMI lock.
The de-whitening/AI board seems to be old - it has 2x 800Hz Butterworth LPFs and no notch for the clock frequency, but maybe this doesn't matter for the tests I have in mind. The AI board available on 1X2 is more modern but routing the DAC channels from 1Y2 to it is going to be some work.
I'm going to add my testpoint to c1daf given that it seems to be the least critical model on c1lsc.
EDIT: testpoints added to c1daf don't show up in the list of available channels - there was some issue with this model while we were getting the new RTCDS going. So I'm moving my temporary testpoint to c1cal instead. |
13360
|
Thu Oct 5 11:46:15 2017 |
gautam | Update | CDS | slow machine bootfest | MC Autolocker was umnhappy because c1iool0 was unresponsive and hence it couldn't write to the "C1:IOO-MC_AUTOLOCK_BEAT" channel. I keyed the crate and IMC locked almost immediately. I'm moving this channel into the RTCDS model as we did for the IFO_STATE EPICS channel so that the autolocker isn't dependant on c1iool0 (which was the whole point of migrating the IFO-STATE variable anyways). I also commented out all of these channels in /cvs/cds/caltech/target/c1iool0/autolocker.db so that there aren't duplicate channels.
Quote: |
Eurocrate key turning reboots for c1susaux, c1auxex,c1auxey, c1iscaux, c1iscaux2 and c1aux. Usual precautions were taken for ITMX. Did burtrestore for c1iscaux andc1iscaux2 in order to restore the LSC PD whitening gains.
Un-related to this work: input pointing into PMC was tweaked as the PMC_REFL spot was pretty bright.
|
|
13361
|
Thu Oct 5 13:58:26 2017 |
gautam | Update | CDS | 40m files backup situation | The 4TB HGST drives have arrived. I've started the FB1 dd backup process. Should take a day or so.
Quote: |
Edit: unmounting /frames won't help, since dd makes a bit for bit copy of the drive being cloned. So we need a drive with size that is >= that of the drive we are trying to clone. On FB1, this is /dev/sda, which has a size of 2TB. The HGST drive we got has an advertised size of 2TB, but looks like actually only 1.8TB is available. So I think we need to order a 4TB drive. |
|
13364
|
Fri Oct 6 12:46:17 2017 |
gautam | Update | CDS | 40m files backup situation | Looks to have worked this time around.
controls@fb1:~ 0$ sudo dd if=/dev/sda of=/dev/sdc bs=64K conv=noerror,sync
33554416+0 records in
33554416+0 records out
2199022206976 bytes (2.2 TB) copied, 55910.3 s, 39.3 MB/s
You have new mail in /var/mail/controls
I was able to mount all the partitions on the cloned disk. Will now try booting from this disk on the spare machine I am testing in the office area now. That'd be a "real" test of if this backup is useful in the event of a disk failure.
Quote: |
The 4TB HGST drives have arrived. I've started the FB1 dd backup process. Should take a day or so.
|
|
13379
|
Thu Oct 12 14:42:45 2017 |
gautam | Update | CDS | slow machine bootfest | Steve reported problems getting the X arm locked. Alignment sliders were inaccessible. Eurocrate key turning reboots for c1susaux, c1auxex,c1auxey, c1iscaux and c1aux. Usual precautions were taken for ITMX.
This is becoming a once-a-week thing . |
13381
|
Mon Oct 16 12:13:38 2017 |
gautam | Update | CDS | Megatron maintenance | Wall StripTool traces showed that IMC has not been locked for at least 8 hours when I came in this morning. Going to the IMC autolocker log, it looks like the last timestamp was at ~6pm yesterday. Megatron was responding to ping, but I couldn't ssh into it. So I went over to the machine and did a hard-reboot via front panel power switch. The computer took ~10mins to come back online and respond to ping. Once it did, I was able to ssh into it. However, trying the usual commands to restart the IMC autolocker and FSS Slow loops didn't work. Specifically, monitoring the logfile with tail -f Autolocker.log, I would see that the autolocker seemed to get stuck after starting the "blinky" script. Trying to restart the process using sudo initctl restart MCautolocker, init would print to shell that the restart had worked, and reported the PID, but the logfile wouldn't update "live" as it should when tail is used with the -f option. All very strange .
Anyways, as a last resort, I kill -9'ed the PID for the init instance, and init automatically restarted the Autolocker - this did the trick, IMC is locked now and logfile seems to be getting updated normally .
I also cleared a bunch of matlab crash dump files in the home directory. |
13385
|
Tue Oct 17 23:07:52 2017 |
gautam | Update | CDS | FEs unresponsive | While working on the IFO tonight, I noticed that the blinky status lights on c1iscex and c1iscey were frozen (but those on the other 3 FEs seemed fine). But all other lights on the CDS overview screen were green I couldn't access testpoints from these machines, and the EPICS readbacks for models on these FEs (e.g. Oplev servo inputs outputs etc) were frozen at some fixed value. This lasted for a good 5 minutes at least. But the blinky lights started blinking again without me doing anything. Not sure what to make of this. I am also not sure how to diagnose this problem, as trending the slow EPICS records of the CPU execution cycle time (for example) doesn't show any irregularity. |
13386
|
Wed Oct 18 01:41:32 2017 |
jamie | Update | CDS | FEs unresponsive |
Quote: |
While working on the IFO tonight, I noticed that the blinky status lights on c1iscex and c1iscey were frozen (but those on the other 3 FEs seemed fine). But all other lights on the CDS overview screen were green I couldn't access testpoints from these machines, and the EPICS readbacks for models on these FEs (e.g. Oplev servo inputs outputs etc) were frozen at some fixed value. This lasted for a good 5 minutes at least. But the blinky lights started blinking again without me doing anything. Not sure what to make of this. I am also not sure how to diagnose this problem, as trending the slow EPICS records of the CPU execution cycle time (for example) doesn't show any irregularity.
|
So this wasn't just an EPICS freeze? I don't see how this had anything to do with any of the work I did earlier today. I didn't modify any of the running front ends, didn't touch either of the end station machines or the DAQ, and didn't modify the network in any way. I didn't leave anything running.
If you couldn't access test points then it sounds like it was more than just EPICS. It sounds like maybe the end machines somehow fell of the network momentarily. Was there anything else going on at the time? |
13387
|
Wed Oct 18 02:09:32 2017 |
gautam | Update | CDS | FEs unresponsive | I was looking at the ASDC channel on dataviewer, and toggling various settings like whitening gain. At some point, the signal just froze. So I quit dataviewer and tried restarting it, at which point it complained about not being able to connect to FB. This is when I brought up the CDS_OVERVIEW medm screen, and noticed the frozen 1pps indicator lights. There was certainly something going on with the end FEs, because I was able to ping the machine, but not ssh into it. Once the 1pps lights came back, I was able to ssh into c1iscex and c1iscey, no problems.
Could it be that some of the mx processes stalled, but the systemctl routine automatically restarted them after some time?
Quote: |
So this wasn't just an EPICS freeze? I don't see how this had anything to do with any of the work I did earlier today. I didn't modify any of the running front ends, didn't touch either of the end station machines or the DAQ, and didn't modify the network in any way. I didn't leave anything running.
If you couldn't access test points then it sounds like it was more than just EPICS. It sounds like maybe the end machines somehow fell of the network momentarily. Was there anything else going on at the time?
|
|
13388
|
Wed Oct 18 09:21:22 2017 |
jamie | Update | CDS | FEs unresponsive |
Quote: |
I was looking at the ASDC channel on dataviewer, and toggling various settings like whitening gain. At some point, the signal just froze. So I quit dataviewer and tried restarting it, at which point it complained about not being able to connect to FB. This is when I brought up the CDS_OVERVIEW medm screen, and noticed the frozen 1pps indicator lights. There was certainly something going on with the end FEs, because I was able to ping the machine, but not ssh into it. Once the 1pps lights came back, I was able to ssh into c1iscex and c1iscey, no problems.
Could it be that some of the mx processes stalled, but the systemctl routine automatically restarted them after some time?
|
An mx_stream glitch would have interrupted data flowing from the front end to the DAQ, but it wouldn't have affected the heartbeat. The heartbeat stop could mean either that the front end process froze, or the EPICS communication stopped. The fact that everything came back fine after a couple of minutes indicates to me that the front end processes all kept running fine. If they hadn't I'm sure the machines would have locked up. The fact that you couldn't connect to the FE machine is also suspicious.
My best guess is that there was a network glitch on the martian network. I don't know how to account for the fact that pings still worked, though. |
13394
|
Wed Oct 18 23:11:53 2017 |
gautam | Update | CDS | FEs unresponsive | This happened again just now - it was roughly this time when this happened last night as well.
There was certainly an EPICS freeze of the kind we were used to seeing prior to replacing the martian wireless router sometime in late 2015 (or early 2016?). I was trying to run the dither alignment servos on the Y-arm at this time, and all the StripTool traces flatlined.
I took the opportunity to try accessing testpoints from the iscey ADCs - specifically C1:SUS-TRY_OUT, and it seemed to work just fine. However, I couldn't ssh into c1iscey.
Looking at the dmesg once I was able to ssh in eventually (~2 minutes deadtime tonight, I feel like it was longer yesterday but can't quantify), I see the following: not sure if there are any clues in here, or whether this is the correct log to check. But there are many instances of the nfs server related message in the log. Note that the system time-stamp corresponds to when this freeze happened.
[5461308.784018] nfs: server 192.168.113.201 not responding, still trying
[5461412.936284] nfs: server 192.168.113.201 OK
[5461412.937130] systemd[1]: Starting Journal Service...
[5461412.947947] systemd-journald[20281]: Received SIGTERM from PID 1 (systemd).
[5461412.996063] systemd[1]: Unit systemd-journald.service entered failed state.
[5461413.002627] systemd[1]: systemd-journald.service has no holdoff time, scheduling restart.
[5461413.008983] systemd[1]: Stopping Journal Service...
[5461413.014664] systemd[1]: Starting Journal Service...
[5461413.044262] systemd[1]: Started Journal Service.
[5461413.694838] systemd-journald[400]: Received request to flush runtime journal from PID 1
|
13396
|
Fri Oct 20 16:30:17 2017 |
gautam | Update | CDS | FB1 installed on shelves | [steve, jamie, gautam]
The machine that now serves as out Frame Builder, FB1, was sitting on top of megatron. I decided that this wasn't ideal, and asked Steve to get some alternative mounting solution. Today, he procured some shelves to put FB1 on. Jamie suggested looking for the slider-rail that came with the machine, and using that instead, as it will allow us to slide FB1 out of the rack as we do megatron and the old FB. But as luck would have it, the distance between the rack vertical posts is 26 inches, but the rail is 27 inches. So we had to accept using the less ideal solution of putting FB1 on two shelves, with no sliding option. Photo to be uploaded shortly.
For this work, I had to shutdown FB1 for about 1 hour between 3pm and 4pm. It seems to have come back up fine now. |
13398
|
Tue Oct 24 16:22:53 2017 |
gautam | Update | CDS | Toy DARM model setup in c1tst | [alex, gautam]
Alex is going to have an undergrad work on a calibration optimization project on the 40m RTCDS system. For this purpose, we wanted to setup a "Simulated DARM loop". Today, Alex and I set this up. I figured we can use the c1tst model for this purpose. We basically copied the topology from Figure 2 of the h(t) paper. Attached are screenshots of the MEDM screens of the system we setup, and the simulink block diagram - the main screen can be accessed from the "SIM PLANT" tab in the sitemp.
It remains to setup the appropriate filters in the filter banks, and an EPICS channel monitor for monitoring the single excitation testpoint in the model. We also did not set up any DQ channels for the time being, as it is not even clear to me what channels need to be DQ-ed. |
Attachment 1: TOY_DARM.png
|
|
Attachment 2: TOY_DARM_SIMULINK.png
|
|
13399
|
Tue Oct 24 16:43:11 2017 |
Steve | Update | CDS | slow machine bootfest | [ Gautam , Steve ]
c1susaux & c1iscaux were rebooted manually.
Quote: |
Had to reboot c1psl, c1susaux, c1auxex, c1auxey and c1iscaux today. PMC has been relocked. ITMX didn't get stuck. According to this thread, there have been two instances in the last 10 days in which c1psl and c1susaux have failed. Since we seem to be doing this often lately, I've made a little script that uses the netcat utility to check which slow machines respond to telnet, it is located at /opt/rtcds/caltech/c1/scripts/cds/testSlowMachines.bash.
The script can be executed by ./testSlowMachines.bash .
|
|
13404
|
Sat Oct 28 00:36:26 2017 |
gautam | Update | CDS | 40m files backup situation - ddrescue | None of the 3 dd backups I made were bootable - at boot, selecting the drive put me into grub rescue mode, which seemed to suggest that the /boot partition did not exist on the backed up disk, despite the fact that I was able to mount this partition on a booted computer. Perhaps for the same reason, but maybe not.
After going through various StackOverflow posts / blogs / other googling, I decided to try cloning the drives using ddrescue instead of dd.
This seems to have worked for nodus - I was able to boot to console on the machine called rosalba which was lying around under my desk. I deliberately did not have this machine connected to the martian network during the boot process for fear of some issues because of having multiple "nodus"-es on the network, so it complained a bit about starting the elog and other network related issues, but seems like we have a plug-and-play version of the nodus root filesystem now.
chiara and fb1 rootfs backups (made using ddrescue) are still not bootable - I'm working on it.
Nov 6 2017: I am now able to boot the chiara backup as well - although mysteriously, I cannot boot it from the machine called rosalba, but can boot it from ottavia. Anyways, seems like we have usable backups of the rootfs of nodus and chiara now. FB1 is still a no-go, working on it.
Quote: |
Looks to have worked this time around.
controls@fb1:~ 0$ sudo dd if=/dev/sda of=/dev/sdc bs=64K conv=noerror,sync
33554416+0 records in
33554416+0 records out
2199022206976 bytes (2.2 TB) copied, 55910.3 s, 39.3 MB/s
You have new mail in /var/mail/controls
I was able to mount all the partitions on the cloned disk. Will now try booting from this disk on the spare machine I am testing in the office area now. That'd be a "real" test of if this backup is useful in the event of a disk failure.
|
|
Attachment 1: 415E2F09-3962-432C-B901-DBCB5CE1F6B6.jpeg
|
|
Attachment 2: BFF8F8B5-1836-4188-BDF1-DDC0F5B45B41.jpeg
|
|
13408
|
Mon Oct 30 11:15:02 2017 |
gautam | Update | CDS | slow machine bootfest + vacuum snafu | Eurocrate key turning reboots today morning for c1psl and c1aux.c1auxex and c1auxey are also down but I didn't bother keying them for now. PSL FSS slow loop is now active again (its inactivity was what prompted me to check status of the slow machines).
Note that the EPCIS channels for PSL shutter are hosted on c1aux.But looks like the slow machine became unresponsive at some point during the weekend, so plotting the trend data for the PSL shutter channel would have you believe that the PSL shutter was open all the time. But the MC_REFL DC channel tells a different story - it suggests that the PSL shutter was closed at ~4AM on Sunday, presumably by the vacuum interlock system. I wonder:
- How does the vacuum interlock close the PSL shutter? Is there a non-EPICS channel path? Because if the slow machine happens to be unresponsive when the interlock wants to close the PSL shutter via EPICS commands, it will be unable to. The fact that the PSL shutter did close suggests that there is indeed another path.
- We should add some feature to the vacuum interlock (if it doesn't already exist) such that the PSL shutter isn't accidentally re-opened until any vacuum related issues are resolved. Steve was immediately able to identify that the problem was vacuum related, but I think I would have just re-opened the PSL shutter thinking that the issue was slow computer related.
|
13410
|
Mon Nov 6 11:15:43 2017 |
gautam | Update | CDS | slow machine bootfest + IFO re-alignment | Eurocrate key turning reboots today morning for and c1susaux, c1auxex and c1auxey. Usual precautions were taken to minimize risk of ITMX getting stuck.
The IFO hasn't been aligned in ~1week, so I recovered arm and PRM alignment by locking individual arms and also PRMI on carrier. I will try recovering DRMI locking in the evening.
As far as MC1 glitching is concerned, there hasn't been any major one (i.e. one in which MC1 is kicked by such a large amount that the autolocker can't lock the IMC) for the past 2 months - but the MC WFS offsets are an indication of when smaller glitches have taken place, and there were large DC offsets on the MC WFS servo outputs, which I offloaded to the DC MC suspension sliders using the MC WFS relief script.
I'd like for the save-restore routine that runs when the slow machines reboot to set the watchdog state default to OFF (currently, after a key-turning reboot, the watchdogs are enabled by default), but I'm not really sure how this whole system works. The relevant files seem to be in the directory /cvs/cds/caltech/target/c1susaux . There is a script in there called startup.cmd , which seems to be the initialization script that runs when the slow machine is rebooted. But looking at this file, it is not clear to me where the default values are loaded from? There are a few "saverestore " files in this directory as well:
saverestore.sav
saverestore.savB
saverestore.sav.bu
saverestore.req
Are the "default" channel values loaded from one of these? |
13420
|
Wed Nov 8 17:04:21 2017 |
gautam | Update | CDS | gds-2.17.15 [not] installed | I wanted to use the foton.py utility for my NB tool, and I remember Chris telling me that it was shipping as standard with the newer versions of gds. It wasn't available in the versions of gds available on our workstations - the default version is 2.15.1. So I downloaded gds-2.17.15 from http://software.ligo.org/lscsoft/source/, and installed it to /ligo/apps/linux-x86_64/gds-2.17.15/gds-2.17.15. In it, there is a file at GUI/foton/foton.py.in - this is the one I needed.
Turns out this was more complicated than I expected. Building the newer version of gds throws up a bunch of compilation errors. Chris had pointed me to some pre-built binaries for ubuntu12 on the llo cds wiki, but those versions of gds do not have foton.py. I am dropping this for now. |
13422
|
Thu Nov 9 15:33:08 2017 |
johannes | Update | CDS | revisiting Acromag |
Quote: |
We probably want to get a dedicated machine that will handle the EPICS channel serving for the Acromag system
|
http://www.supermicro.com/products/system/1U/5015/SYS-5015A-H.cfm?typ=H
This is the machine that Larry suggested when I asked him for his opinion on a low workload rack-mount unit. It only has an atom processor, but I don't think it needs anything particularly powerful under the hood. He said that we will likely be able to let us borrow one of his for a couple days to see if it's up to the task. The dual ethernet is a nice touch, maybe we can keep the communication between the server and the DAQ units on their separate local network. |
13436
|
Tue Nov 21 11:21:26 2017 |
gautam | Update | CDS | RFM network down | I noticed yesterday evening that I wasn't able to engage the single arm locking servos - turned out that they weren't getting triggered, which in turn pointed me to the fact that the arm transmssion channels seemed dead. Poking around a little, I found that there was a red light on the CDS overview screen for c1rfm.
- The error seems to be in the receiving model only, i.e. c1rfm, all the sending models (e.g. c1scx) don't report any errors, at least on the CDS overview screen.
- Judging by dataviewer trending of the c1rfm status word, seems like this happened on Sunday morning, around 11am.
- I tried restarting both sender and receiver models, but error persists.
- I got no useful information from the dmesg logs of either c1sus (which runs c1rfm), or c1iscex (which runs c1scx).
- There are no physical red lights in the expansion chassis that I could see - in the past, when we have had some timing errors, this would be a signature.
Not sure how to debug further...
* Fix seems to be to restart the sender RFM models (c1scx, c1scy, c1asx, c1asy). |
Attachment 1: RFMerrors.png
|
|
13477
|
Thu Dec 14 19:41:00 2017 |
gautam | Update | CDS | CDS recovery, NFS woes | [Koji, Jamie(remote), gautam]
Summary: The CDS system seems to be back up and functioning. But there seems to be some pending problems with the NFS that should be looked into.
We locked Y-arm, hand aligned transmission to 1 . Some pending problems with ASS model (possibly symptomatic of something more general). Didn't touch Xarm because we don't know what exactly the status of ETMX is.
Problems raised in elogs in the thread of 13474 and also 13436 seem to be solved.
I would make a detailed post with how the problems were fixed, but unfortunately, most of what we did was not scientific/systematic/repeatable. Instead, I note here some general points (Jamie/Koji can addto /correct me):
- There is a "known" problem with unloading models on c1lsc. Sometimes, running rtcds stop <model> will kill the c1lsc frontend.
- Sometimes, when one machine on the dolphin network goes down, all 3 go down.
- The new FB/RCG means that some of the old commands now no longer work. Specifically, instead of telnet fb 8087 followed by shutdown (to fix DC errors) no longer works. Instead, ssh into fb1, and run sudo systemctl restart daqd_*.
- Timing error on c1sus machine was linked to the mx_stream processes somehow not being automatically started. The "!mxstream restart" button on the CDS overview MEDM screen should run the necessary commands to restart it. However, today, I had to manually run sudo systemctl start mx_stream on c1sus to fix this error. It is a mystery why the automatic startup of this process was disabled in the first place. Jamie has now rectified this problem, so keep an eye out.
- c1oaf persistently reported DC errors (0x2bad) that couldn't be addressed by running mxstream restart or restarting the daqd processes on FB1. Restarting the model itself (i.e. rtcds restart c1oaf) fixed this issue (though of course I took the risk of having to go into the lab and hard-reboot 3 machines).
- At some point, we thought we had all the CDS lights green - but at that point, the END FEs crashed, necessitating Koji->EX and Gautam->EY hard reboots. This is a new phenomenon. Note that the vertex machines were unaffected.
- At some point, all the DC lights on the CDS overview screen went white - at the same time, we couldn't ssh into FB1, although it was responding to ping. After ~2mins, the green lights came back and we were able to connect to FB1. Not sure what to make of this.
While trying to run the dither alignment scripts for the Y-arm, we noticed some strange behaviour:
Even when there was no signal (looking at EPICS channels) at the input of the ASS servos, the output was fluctuating wildly by ~20cts-pp.
This is not simply an EPICS artefact, as we could see corresponding motion of the suspension on the CCD.
A possible clue is that when I run the "Start Dither" script from the MEDM screen, I get a bunch of error messages (see Attachment #2).
Similar error messages show up when running the LSC offset script for example. Seems like there are multiple ports open somehow on the same machine?
There are no indicator lights on the CDS overview screen suggesting where the problem lies.
Will continue investigating tomorrow.
Some other general remarks:
- ETMX watchdog remains shutdown.
- ITMY and BS oplevs have been hijacked for HeNe RIN / Oplev sensing noise measurement, and so are not enabled.
- Y arm trans QPD (Thorlabs) has large 60Hz harmonics. These can be mitigated by turning on a 60Hz comb filter, but we should check if this is some kind of ground loop. The feature is much less evident when looking at the TRANS signal on the QPD.
UPDATE 8:20pm:
Koji suggested trying to simply retsart the ASS model to see if that fixes the weird errors shown in Attachment #2. This did the trick. But we are now faced with more confusion - during the restart process, the various indicators on the CDS overview MEDM screen froze up, which is usually symptomatic of the machines being unresponsive and requiring a hard reboot. But we waited for a few minutes, and everything mysteriously came back. Over repeated observations and looking at the dmesg of the frontend, the problem seems to be connected with an unresponsive NFS connection. Jamie had noted sometime ago that the NFS seems unusually slow. How can we fix this problem? Is it feasible to have a dedicated machine that is not FB1 do the NFS serving for the FEs? |
Attachment 1: CDS_14Dec2017.png
|
|
Attachment 2: CDS_errors.png
|
|
13479
|
Fri Dec 15 00:26:40 2017 |
johannes | Update | CDS | Re: CDS recovery, NFS woes |
Quote: |
Didn't touch Xarm because we don't know what exactly the status of ETMX is.
|
The Xarm is currently in its original state, all cables are connected and c1auxex is hosting the slow channels. |
13480
|
Fri Dec 15 01:53:37 2017 |
jamie | Update | CDS | CDS recovery, NFS woes |
Quote: |
I would make a detailed post with how the problems were fixed, but unfortunately, most of what we did was not scientific/systematic/repeatable. Instead, I note here some general points (Jamie/Koji can addto /correct me):
- There is a "known" problem with unloading models on c1lsc. Sometimes, running rtcds stop <model> will kill the c1lsc frontend.
- Sometimes, when one machine on the dolphin network goes down, all 3 go down.
- The new FB/RCG means that some of the old commands now no longer work. Specifically, instead of telnet fb 8087 followed by shutdown (to fix DC errors) no longer works. Instead, ssh into fb1, and run sudo systemctl restart daqd_*.
|
This should still work, but the address has changed. The daqd was split up into three separate binaries to get around the issue with the monolithic build that we could never figure out. The address of the data concentrator (DC) (which is the thing that needs to be restarted) is now 8083.
Quote: |
UPDATE 8:20pm:
Koji suggested trying to simply retsart the ASS model to see if that fixes the weird errors shown in Attachment #2. This did the trick. But we are now faced with more confusion - during the restart process, the various indicators on the CDS overview MEDM screen froze up, which is usually symptomatic of the machines being unresponsive and requiring a hard reboot. But we waited for a few minutes, and everything mysteriously came back. Over repeated observations and looking at the dmesg of the frontend, the problem seems to be connected with an unresponsive NFS connection. Jamie had noted sometime ago that the NFS seems unusually slow. How can we fix this problem? Is it feasible to have a dedicated machine that is not FB1 do the NFS serving for the FEs?
|
I don't think the problem is fb1. The fb1 NFS is mostly only used during front end boot. It's the rtcds mount that's the one that sees all the action, which is being served from chiara. |
13481
|
Fri Dec 15 11:19:11 2017 |
gautam | Update | CDS | CDS recovery, NFS woes | Looking at the dmesg on c1iscex for example, at least part of the problem seems to be associated with FB1 (192.168.113.201, see Attachment #1). The "server" can be unresponsive for O(100) seconds, which is consistent with the duration for which we see the MEDM status lights go blank, and the EPICS records get frozen. Note that the error timestamped ~4000 was from last night, which means there have been at least 2 more instances of this kind of freeze-up overnight.
I don't know if this is symptomatic of some more widespread problem with the 40m networking infrastructure. In any case, all the CDS overview screen lights were green today morning, and MC autolocker seems to have worked fine overnight.
I have also updated the wiki page with the updated daqd restart commands.
Unrelated to this work - Koji fixed up the MC overview screen such that the MC autolocker button is now visible again. The problem seems to do with me migrating some of the c1ioo EPICS channels from the slow machine to the fast system, as a result of which the EPICS variable type changed from "ENUM" to something that was not "ENUM". In any case, the button exists now, and the MC autolocker blinky light is responsive to its state.
Quote: |
I don't think the problem is fb1. The fb1 NFS is mostly only used during front end boot. It's the rtcds mount that's the one that sees all the action, which is being served from chiara.
|
|
Attachment 1: NFS.png
|
|
Attachment 2: MCautolocker.png
|
|
13518
|
Tue Jan 9 11:52:29 2018 |
gautam | Update | CDS | slow machine bootfest | Eurocrate key turning reboots today morning for and c1susaux, c1auxey and c1iscaux. These were responding to ping but not telnet-able. Usual precautions were taken to minimize risk of ITMX getting stuck.
|
13522
|
Wed Jan 10 12:24:52 2018 |
gautam | Update | CDS | slow machine bootfest | MC autolocker got stuck (judging by wall StripTool traces, it has been this way for ~7 hours) because c1psl was unresponsive so I power cycled it. Now MC is locked. |
13536
|
Thu Jan 11 21:09:33 2018 |
gautam | Update | CDS | revisiting Acromag | We'd like to setup the recording of the PSL diagnostic connector Acromag channels in a more robust way - the objective is to assess the long term performance of the Acromag DAQ system, glitch rates etc. At the Wednesday meeting, Rana suggested using c1ioo to run the IOC processes - the advantage being that c1ioo has the systemd utility, which seems to be pretty reliable in starting up various processes in the event of the computer being rebooted for whatever reason. Jamie pointed out that this may not be the best approach however - because all the FEs get the list of services to run from their common shared drive mount point, it may be that in the event of a power failure for example, all of them try and start the IOC processes, which is presumably undesirable. Furthermore, Johannes reported the necessity for the procServ utility to be able to run the modbusIOC process in the background - this utility is not available on any of the FEs currently, and I didn't want to futz around with trying to install it.
One alternative is to connect the PSL Acromag also to the Supermicro computer Johannes has set up at the Xend - it currently has systemd setup to run the modbusIOC, so it has all the utilities necessary. Or else, we could use optimus, which has systemd, and all the EPICS dependencies required. I feel less wary of trying to install procServ on optimus too. Thoughts?
|
13558
|
Fri Jan 19 11:13:21 2018 |
gautam | Update | CDS | slow machine bootfest | c1psl, c1susaux, and c1auxey today
Quote: |
MC autolocker got stuck (judging by wall StripTool traces, it has been this way for ~7 hours) because c1psl was unresponsive so I power cycled it. Now MC is locked.
|
|
13620
|
Thu Feb 8 00:01:08 2018 |
gautam | Update | CDS | Vertex FEs all crashed | I was poking around at the LSC rack to try and set up a temporary arrangement whereby I take the signals from the DAC differentially and route them to the D990694 differentially. The situation is complicated by the fact that, afaik, we don't have any break out boards for the DIN96 connectors on the back of all our Eurocrate cards (or indeed for many of the other funky connecters we have like IDE/IDC 10,50 etc etc). I've asked Steve to look into ordering a few of these. So I tried to put together a hacky solution with an expansion card and an IDC64 connector. I must have accidentally shorted a pair of DAC pins or something, because all models on the c1lsc FE crashed . On attempting to restart them (c1lsc was still ssh-able), the usual issue of all vertex FEs crashing happened. It required several iterations of me walking into the lab to hard-reboot FEs, but everything is back green now, and I see the AS beam on the camera so the input pointing of the TTs is roughly back where it was. Y arm TEM00 flashes are also seen. I'm not going to re-align the IFO tonight. Maybe I'll stick to using a function generator for the THD tests, probably routing non AI-ed signals directly is as bad as any timing asynchronicity between funcGen and DAQ system... |
Attachment 1: CDSrecovery_20180207.png
|
|
13643
|
Tue Feb 20 21:14:59 2018 |
gautam | Update | CDS | RFM network errors | I wanted to lock the single arm POX/POY config to do some tests on the BeatMouth. But I was unable to.
- I tracked the problem down to the fact that the TRX and TRY triggers weren't getting piped correctly to the LSC model
- In fact, all RFM channels from the end machines were showing error rates of 16384/sec (i.e. every sample).
- After watchdogging ETMX, I tried restarting just the c1scx model - this promptly took down the whole c1iscex machine.
- Then I tried the same with c1iscey - this time the models restarted successfully without the c1iscey machine crashing, but the RFM errors persisted for the c1scy channels.
- I walked down to EX and hard rebooted c1iscex.
- c1iscex came back online, and I ssh-ed in and did rtcds start --all.
- This brought all the models back online, and the RFM errors on both c1iscex and c1iscey channels vanished.
Not sure what to make of all this, but I can lock the arms now. |
13646
|
Wed Feb 21 12:17:04 2018 |
gautam | Update | CDS | LO Power mon channels added to c1lsc | To make this setup more permanent, I modified the c1lsc model to pipe the LO power monitor signals from the Demod chassis to unused channels ADC_0_25 (X channel LO) and ADC_0_26 (Y channel LO) in the c1lsc model. I also added a couple of CDS filter blocks inside the "ALS" namespace block in c1lsc so as to allow for calibration from counts to dBm. I didn't add any DQ channels for now as I think the slow EPICS records will be sufficient for diagnostics. It is kind of overkill to use the fast channels for DC voltage monitoring, but until we have acromag channels readily accessible at 1Y2, this will do.
Modified model compiled and installed successfully, though I have yet to restart it given that I'll likely have to do a major reboot of all vertex FEs  |
13727
|
Wed Apr 4 16:23:39 2018 |
gautam | Update | CDS | slow machine bootfest | [johannes, gautam]
It's been a while - but today, all slow machines (with the exception of c1auxex) were un-telnetable. c1psl, c1iool0, c1susaux, c1iscaux1, c1iscaux2, c1aux and c1auxey were rebooted. Usual satellite box unplugging was done to avoid ITMX getting stuck. |
|