Added Matlab to the Docker machine. This should help immensely with workflow as well as keeping installed libraries consistent. Next step is outlining the project so coding is easier
Command to launch is: $ matlab &
From Jon just for bookkeeping:
Then in the Matlab command window, open the CDS parts library via:
Then open an RTCDS model (for example, here the LSC plant) via:
The x1SUSsim model on the docker was made in a more recent version of Simulink so I updated Matlab (see this)
I updated Matlab to 2021a so now the docker has 2020b and 2021a installed. This should also install Simulink 10.3 for the sus model to open. I used my account to activate it but I can change it over if I get a dedicated license for this. I am not sure what Jon did for the 2020b that he installed.
it is giving me "License error: -9,57" so I guess it didn't work... I will try to just make the model on the 2020b just so I have something.
I was able to fix the problem with the activation (using this).
I can now open the x1sussim.mdl file. It is trying to access a BSFM_MASTER Library that it doesn't seem to have. the other files don't seem to have any warnings though.
the simple suspension model can be found in home/controls/docker-cymac/userapps/c1simpsus.mdl on the docker system. This is where I will put my model. (right now it is just a copied file)
Also using Simulink on the docker is very slow. I think this is either a limit of the x2goclient software or the hardware that the docker is running on.
So I am stuck on how to add the control block to my model. I am trying to make it as simple as possible with just a simple transfer function for a damped harmonic oscillator and then the control block (see overview.png).
The transfer function I am using is:
For future generations: To measure the transfer function (to verify that it is doing what it says it is) I am using the discrete transfer function estimator block. To get a quick transfer function estimator Simulink program run ex_discrete_transfer_function_estimator in the Matlab command line. This works well for filters but it was hit or miss using the discrete transfer function.
The roadblock I am running into right now is that I can't figure out how to add the controller to the model. Not on an interpretation level but in a very straightforward "can I drag it into my model and will it just work" kind of way.
I am also a little confused as to exactly which block would do the controlling. Because I want to just use the x of the pendulum (its position) I think I want to use the suspension controls which come are connected to in the suspension plant model. But where exactly is it and how can I get the model? I can't seem to find it.
The controller would be in the c1sus model, and connects to the c1sup plant model. So the controller doesn't go in the plant model.
Both the controller and the plant can be modeled using a single filter module in each separate model as you've drawn, but they go in separate models.
I have attached the framework that I am using for the full system. Plantframework.pdf has the important aspects that I will be changed. Right now I am trying to keep it mostly as is, but I have disconnected the Optic Force Noise and hope to disconnect the Suspension Position Noise. The Optic Force Noise Block is additive to the signal so eliminating it from the system should make it less realistic but simpler. It can be added back easily by reconnecting it.
The next step is adding my plant response, which is simply the transfer function and measurement from the last post. These should be inserted in the place of the red TM_RESP in the model.
The TM_RESP block takes in a vector of dimension 12 and returns a vector of dimension 6. The confusing part is that the block does not seem to do anything. it simply passes the vector through with no changes. I'm not sure why this is the case and I am looking for documentation to explain it but haven't found anything. As to how a 12 vector turns into a 6 vector I am also lost. I will probably just disconnect everything but the x position.
I tried to just throw in my model (see Simple_Plant.pdf) and see what happened but the model would not let me add built-in blocks to the model. This is weird because all the blocks that I am adding are part of the basic library. My guess is that this mode will only accept blocks from the CDL library. I will either need to change my blocks to be made from blocks in the CDL library or maybe I can pass the signal out of the plant framework model then into my model then back to the plant framework model. I think this is just a Matlab thing that I don't know about yet. (Jon probably knows)
I have also attached an image of the controls model for reference. It looks like a mess but I'm sure there is a method. I won't get lost in going through it just assume it works... for now.
The next question I have been asking is how do I show that the system works. When anchal and I made a python simulation of the system, we tested it by seeing the evolution of the degrees of freedom over time given some initial conditions. We could see the pendulum propagating and some of the coupling between the DOFs. This is a fast and dirty way to check if everything is working and should be easy to add. I simply recorded the POS signal and graph it over time. Once we get to a state-space model we can test it by taking the transfer function, but since our plant is literally already just a transfer function there isn't much of a point yet.
Also, I need to add color to my Simple_Plant.pdf model because it looks really boring :(
Action Items from Last Week:
Action Items this Week and LEAD PERSON:
Assemble and ship 4 TTs from LHO - SURESH
Prepare electronics for TTs (coil drivers) - JAMIE
In-air TT testing to confirm we can control / move TTs before we vent (starting in ~2 weeks) - SURESH
Connect TTs to digital system and controls, lay cables if needed - JAMIE with SURESH
OAF comparison plot, both online and offline, comparing static, adaptive and static+adaptive - DEN
Static-only OAF noise budget (Adaptive noise budget as next step) - DEN
Black glass: big baffle pieces to clean&bake, get small pieces from Bob, put into baskets, make new basket for 1" pieces, get to clean&bake - KOJI
IPPOS beam measurement - SURESH with JENNE
AS beam measurement (if beam is bright enough) - SURESH and JENNE
Mode matching calculations, sensitivity to MC waist measurement errors, PRM position - JENNE
Summary of IFO questions, measurements to take, and game plan - JENNE
Think up diagnostic measurement to determine mode matching to PRC while chambers are open, while we tweak MMT - JAMIE, JENNE, KOJI, SURESH
Arm cavity sweeps, mode scan - JENNE
Align AS OSA (others?) - JENNE
Investigate PRMI glitches, instability (take PRM oplev spectra locked, unlocked, to see if PRM is really moving) - KOJI, JENNE, DEN
Connect up beatbox for Xarm use - KOJI with JENNE
Build amplifiers for new small microphones - DEN
Black glass: to clean&bake - KOJI
Scattered light measurement at the end stations: design / confirmation of the mechanical parts/optics/cameras - JAN
Upgrade Rossa, Allegra to Ubuntu, make sure Dataviewer and DTT work - JAMIE
I'm combining the IFO check-up list (elog 6595) and last week's action items list (elog 6597). I thought about making it a wiki page, but this way everyone has to at least scroll past the list ~1/week.
Feel free to cross things out as you complete them, but don't delete them. Also, if there's a WHO?? and you feel inspired, just do it!
Dither-align arm to get IR on actuation nodes, align green beam - JENNE
Arm cavity sweeps, mode scan - JENNE
ASS doesn't run on Ubuntu! or CentOS Fix it! - JENNE, JAMIE's help
Input matricies, output filters to tune SUS. check after upgrade. - JENNE
POX11 whitening is not toggling the analog whitening??? - JAMIE, JENNE, KOJI
THE FULL LIST:
cd /opt/rtcds/caltech/c1/burt/autoburt/today/ -
Align Ygreen beam - JENNE, YUTA
Arm cavity sweeps, mode scan - JENNE, YUTA
ASS doesn't run on Ubuntu! or CentOS Fix it! - YUTA, JENNE, JAMIE's help
Decide on plots for 40m Summary page - DEN, STEVE, JENNE, KOJI, JAMIE, YUTA, SURESH, RANA, DUNCAN from Cardiff/AEI
Look into PMC PZT drift - PZT failing? Real MC length change? - JENNE, KOJI, YUTA
We tried to locate the sixteen analog output channels we need to control the four tip-tilts (four coils on each). We have only 8 available channels on the C1SUS machine.
So we will have to plug-in a new DAC output card on one of the machines and it would be logical to do that on the C1IOO machine as the active tip-tilts are conceptually part of the IOO sub-system. We have to procure this card if we do not already have it. We have to make an interface between this card output and a front panel on the 1X2 rack. We may have to move some of the sub-racks on the 1X2 rack to accommodate this front panel.
We checked out the availability of cards (De-whitening, Anti-imaging, SOS coil drivers) yesterday. In summary: we have all the cards we need (and some spares too). As the De-whitening and Anti-imaging cards each have 8 channels, we need only two of each to address the sixteen channels. And we need four of the SOS coil drivers, one for each tip-tilt. There are 9 slots available on the C1IOO satellite expansion chassis (1X1 rack), where these eight cards could be accommodated.
There are two 25 pin feed-thoughs, where the PZT drive signals currently enter the BS chamber. We will have to route the SOS coil driver outputs to these two feed-throughs.
Inside the BS chamber, there are cables which carry the PZT signals from the chamber wall to the the table top, where they are anchored to a post (L- bracket). We need a 25-pin-to-25-pin cable (~2m length) to go from the post to the tip-tilt (one for each tip-tilt). And then, of course, we need quadrapus cables (requested from Rich) which fit inside each tip-tilt to go to the BOSEMs.
I am summarising it all here to give an overview of the work involved.
I'm working on developing a full noise budget for the 40m. To that end, I'll use measurements from the GUR1 seismometer to characterize seismic noise. Without any unit calibration, I found the following spectrum,
To extract useful information from this data, I first used the calibration from "/users/Templates/Seismic-Spectra_121213.xml" to obtain the spectrum in [m / s / sqrt(Hz)].
calibrated_data = raw_data * 3.8e-09
I then divided each point in the power spectrum by the frequency of said point to obtain [m / sqrt(Hz)]. I don't think we can simply divide the whole spectrum by 40 meters to obtain [RIN / sqrt(Hz)], although that was my immediate intuition. Having power spectra of all the major noise contributions in units of [RIN / sqrt(Hz)] would make designing an appropriate filtering servo fairly straightforward.
I made a node to collect drawings/schematics for the 40m OMC, added the length drive for now. We should collect other stuff (TT drivers, AA/AI, mechanical drawings etc) there as well for easy reference.
Some numbers FTR:
P-pol = purple
S-pol = red
The .graffle file for this is in the 40m SVN's omnigraffle dir/
Koji pointed out during the group meeting that I should compensate for local tilt when I move the beam around the mirror for calculating the loss map.
So I did.
Also, I made a mistake earlier by calculating the loss map for a much bigger (X7) area than what I thought.
Both these mistakes made it seem like the loss is very inhomogeneous across the mirror.
Attachment 1 and 2 show the corrected loss maps for ITMX and ETMX respectively.
The loss now seems much more reasonable and homogeneous and the average total arm loss sums up to ~ 22ppm which is consistent with the after-cleaning arm loss measurements.
I finished calculating the X Arm loss using first-order perturbation theory. I will post the details of the calculation later.
I calculated loss maps of ITM and ETM (attachments 1,2 respectively). It's a little different than previous calculation because now both mirrors are considered and total cavity loss is calculated. The map is calculated by fixing one mirror and shifting the other one around.
The losin total is pretty much the same as calculated before using a different method. At the center of the mirror, the loss is 21.8ppm which is very close to the value that was calculated.
Next thing is to try SIS.
The cavity modes , where q is the complex beam parameter and m,n is the mode index, are the eigenmodes of the cavity propagator. That is:
where is the mirror reflection matrix. At the 40m, ITM is flat, so . ETM is curved, so , where R is the ETM's radius of curvature.
is the Gouy phase.
is the free-space field propagator. When acting on a state it propagates the field a distance L.
The phase maps perturb the reflection matrices slightly so:
Where h_12 are the height profiles of the ITM and ETM respectively. The new propagator is
, where is the unperturbed propagator and
To find the perturbed ground state mode we use first-order perturbation theory. The new ground state is then
Where N is the normalization factor. The (0,1) and (1,0) modes are omitted because they can be zeroed by tilting the mirrors. Gouy phase of TEM00 mode is taken to be 0.
Some simplification can be made here:
The last step is possible since the beam parameter q matches the cavity.
The loss of the TEM00 mode is then:
I have a serious concern about this low angle scattering analysis:
Phase maps perturb the spatial mode of the steady-state of the cavity, but how is this different than mode-mismatch? The loss that I calculated is an overall loss, not roundtrip loss.
The only way I can think this can become serious loss is when the HOMs themselves have very high roundtrip loss. Attached is the modal power fraction that I calculated.
These are all priority action items need to be done before I come back (in mid-September).
BE PREPARED FOR THE FULL LOCK!
- Prepare and install tip-tilts -JAMIE
- Adjust IP-ANG -JAMIE, JENNE, KOJI
- Make sure there's no clipping. Start from MC centering -JAMIE, JENNE, KOJI
- Make ASS and A2L work -JENNE, JAMIE
- Better MC spot position measurement script(see the last sentence in elog #6892) -JENNE
- Daily beam spot measurements for IFO, just like MC -JENNE
- ASS for green using PZT steering mirrors on end table -JENNE
- Modeling of phase tracking ALS -JAMIE
- PZT mounts for PSL and ALS beams -JENNE, KOJI
- Add temperature sensors for end lasers to CDS slow channels -JENNE
- Put green trans camera, GTRY PD, and GTRX PD on PSL table -JENNE
- Better beat box; include comparators, frequency dividers, and whitening filters -JAMIE, KOJI
- Adjust servo gain/filters of end green PDH lock (reduce frequency noise) -JENNE
- Add on/off switch, gain adjuster, etc to CDS for end green PDH lock -JENNE, JAMIE
- Find why and reduce 3 Hz motion -JENNE
- Simulation of PRMI with clipping -YUTA
- Alignment tolerance of PRMI -YUTA
Bob, Callum and Daphen noted that our keeping a JDSU HeNe (max power <4mW) is against somebody's SOP. So I cleared everything that relates to 40m SOS suspending to the bottom shelf of the 2nd cabinet in the cleanroom (the back set of cabinets nearest the flow benches). The door has a nifty label. Things that are in there include:
QPD and micrometer mount
microscope and micrometer mount
Al beam block
Magnet gluing fixture
dumbbell gluing fixture
The electronics that we use (HeNe's power supply, 'scope, QPD readout) are still on the roll-y thing under the flow bench.
Set up gwsumm on optimus and generated summary pages from both L1 and C1 data. Still a few manual steps need to be taken during generation, not fully automated due to some network/username issues. nds2 now working from optimus after restarting nds2 server.
Here's the gist of the requirements on the 5x frequency multiplier for the upgrade (see attachemnt - despite the preview down here in the elog, they're 3 pages).
An extended version is available on the wiki.
A more complete document is under work and will soon be available on the wiki as well.
I wrote down the settings according to which I tuned the optickle model of the 40m Upgrade.
Basically I set it so that:
In this way when the carrier becomes resonant in the arms we have:
The DARM offset for DC readout is optional, and doesn't change those conditions.
I also plotted the carrier and the sideband's circulating power for both recycling cavities.
I'm attaching a file containing more detailed explanations of what I said above. It also contains the plots of field powers, and transfer functions from DARM to the dark port. I think they don't look quite right. There seems to be something wrong.
Valera thought of fixing the problem, removing the 180 degree offset on the SRM, which is what makes the sideband rather than the carrier resonant in SRC. In his model the carrier becomes resonant and the sideband anti-resonant. I don't think that is correct.
The resonant-carrier case is also included in the attachment (the plots with SRMoff=0 deg). In the plots the DARM offset is always zero.
I'm not sure why the settings are not producing the expected transfer functions.
In my calculation of the digital filters of the optical transfer functions the carrier light is resonant in coupled cavities and the sidebands are resonant in recycling cavities (provided that macroscopic lengths are chosen correctly which I assumed).
Carrier and SB (f2) shouldn't be resonant at the same time in the SRC-arms coupled cavity. No additional filtering of the GW signal is wanted.
The SRC macroscopic length is chosen to be = c / f2 - rather than = [ (n+1/2) c / (2*f2) ] - accordingly to that purpose.
I calculated the frequency of the double cavity pole for the 40m SRC-arm coupled cavity.
w_cc = (1 + r_srm)/(1- r_srm) * w_c
where w_c is the arm cavity pole angular frequency [w_c = w_fsr * (1-r_itm * r_etm)/sqrt(r_itm*r_etm) ]
I found the pole at about 160KHz. This number coincides with what I got earlier with my optickle model configured and tuned as I said in my previous entry. See attachments for plots of transfer functions with 0 and 10pm DARM offsets, respectively.
I think the resonance at about 20 Hz that you can see in the case with non-zero DARM offset, is due to radiation pressure. Koji suggested that I could check the hypothesis by changing either the mirrors' masses or the input power to the interferometer. When I did it frequency and qualty factor of the resonance changed, as you would expect for a radiation pressure effect.
This gave me more confidence about my optickle model of the 40m. This is quite comforting since I used that model other times in the past to calculate several things (i.e. effects of higher unwanted harmonics from the oscillator, or, recently, the power at the ports due to the SB resonating in the arms).
Mike Pedraza came by today to install a new wireless network router configured for the 40m lab network. It has a 'secret' SSID i.e. not meant for public use outside the lab. You can look up the password and network name on the rack. Pictures below show the location of the labels.
Mike P swapped in a new network router Linksys E1000
Received the campus power outage this (Dec 30, 2022) morning.
- ELOG is still up
- 6 CDS machines are up
- Vacuum system and pressure all look good
I believe this power outage happened on the other side of the campus and did not affect the 40m.
The following is not 100% accurate, but represents my understanding of the events currently. I'm trying to get a full description from Christian and will hopefully be able to update this information later today.
Last night around 7:30 pm, Caltech detected evidence of computer virus located behind a linksys router with mac address matching our NAT router, and at the IP 18.104.22.168. We did not initially recognize the mac address as the routers because the labeled mac address was off by a digit, so we were looking for another old router for awhile. In addition, pings to 22.214.171.124 were not working from inside or outside of the martian network, but the router was clearly working.
However, about 5 minutes after Christian and Mike left, I found I could ping the address. When I placed the address into a web browser, the address brought us to the control interface for our NAT router (but only from the martian side, from the outside world it wasn't possible to reach it).
They turned logging on the router (which had been off by default) and started monitoring the traffic for a short time. Some unusual IP addresses showed up, and Mike said something about someone trying to IP spoof warning coming up. Something about a file sharing port showing up was briefly mentioned as well.
The outside IP address was changed to 126.96.36.199 and dhcp which apparently was on, was turned off. The password was changed and is in the usual place we keep router passwords.
Update: Christian said Mike has written up a security report and that he'll talk to him tomorrow and forward the relevant information to me. He notes there is possibly an infected laptop/workstation still at large. This could also be a personal laptop that was accidently connected to the martian network. Since it was found to be set to dhcp, its possible a laptop was connected to the wrong side and the user might not have realized this.
The 40m computers were responding sluggishly yesterday, to the point of being unusable.
The mx_stream code running on c1iscex (the X end suspension control computer) went crazy for some reason. It was constantly writing to a log file in /cvs/cds/rtcds/caltech/c1/target/fb/192.168.113.80.log. In the past 24 hours this file had grown to approximately 1 Tb in size. The computer had been turned back on yesterday after having reconnected its IO chassis, which had been moved around last week for testing purposes - specifically plugging the c1ioo IO chassis in to it to confirm it had timing problems.
The mx_stream code was killed on c1iscex and the 1 Tb file removed.
Computers are now more usable.
We still need to investigate exactly what caused the code to start writing to the log file non-stop.
Alex believes this was due to a missing entry in the /diskless/root/etc/hosts file on the fb machine. It didn't list the IP and hostname for the c1iscex machine. I have now added it. c1iscex had been added to the /etc/dhcp/dhcpd.conf file on fb, which is why it was able to boot at all in the first place. With the addition of the automatic start up of mx_streams in the past week by Alex, the code started, but without the correct ip address in the hosts file, it was getting confused about where it was running and constantly writing errors.
When adding a new FE machine, add its IP address and its hostname to the /diskless/root/etc/hosts file on the fb machine.
The mx_stream code running on c1iscex (the X end suspension control computer) went crazy for some reason. It was constantly writing to a log file in /cvs/cds/rtcds/caltech/c1/target/fb/192.168.113.80.log. In the past 24 hours this file had grown to approximatel
The moral of the story is, PUT THINGS IN THE ELOG. This wild process is one of those things where people say 'this won't effect anything', but in fact it wastes several hours of time.
As part of the effort to debug what was happening with the slow computers, I disabled the auto MEDM snapshots process that Yoichi/Kakeru setup some long time ago:
We have to re-activate it now that the MEDM screen locations have been changed. To do this, we have to modify the crontab on nodus and also the scripts that the cron is calling. I would prefer to run this cron on some linux machine since nodus starts to crawl whenever we run ImageMagick stuff.
Also, we should remember to start moving the old target/ directories into the new area. All of the slow VME controls are still not in opt/rtcds/.
What are the critical filesystems? I've also indicated the size of these disks and the volume currently used, and the current backup situation.
Not backed up
LDAS pulls files from nodus daily via rsync, so there's no cron job for us to manage. We just allow incoming rsync.
Local backup on /media/40mBackup on chiara via daily cronjob
Remote backup to ldas-cit.ligo.caltech.edu::40m/cvs via daily cronjob on nodus
Currently mounted on Megatron, not backed up.
Then there is Optimus, but I don't think there is anything critical on it.
So, based on my understanding, we need to back up a whole bunch of stuff, particularly the boot disks and root filesystems for Chiara, Megatron and Nodus. We should also test that the backups we make are useful (i.e. we can recover current operating state in the event of a disk failure).
Please edit this elog if I have made a mistake. I also don't have any idea about whether there is any sort of backup for the slow computing system code.
In addition to bootable full disk backups, it would be wise to make sure the important service configuration files from each machine are version controlled in the 40m SVN. Things like apache files on nodus, martian hosts and DHCP files on chiara, nds2 configuration and init scripts on megatron, etc. This can make future OS/hardware upgrades easier too.
I first initialized the drives by hooking them up to my computer and running the setup.app file. After this, plugging the drive into the respective machine and running lsblk, I was able to see the mount point of the external drive. To actually initialize the backup, I ran the following command from a tmux session called ddBackupLaCie:
sudo dd if=/dev/sda of=/dev/sdb bs=64K conv=noerror,sync
Here, /dev/sda is the disk with the root filesystem, and /dev/sdb is the external hard-drive. The installed version of dd is 8.13, and from version 8.21 onwards, there is a progress flag available, but I didn't want to go through the exercise of upgrading coreutils on multiple machines, so we just have to wait till the backup finishes.
We also wanted to do a backup of the root of FB1 - but I'm not sure if dd will work with the external hard drive, because I think it requires the backup disk size (for us, 1TB) to be >= origin disk size (which on FB1, according to df -h, is 2TB). Unsure why the root filesystem of FB is so big, I'm checking with Jamie what we expect it to be. Anyways we have also acquired 2TB HGST SATA drives, which I will use if the LaCie disks aren't an option.
After consulting with Jamie, we reached the conclusion that the reason why the root of FB1 is so huge is because of the way the RAID for /frames is setup. Based on my googling, I couldn't find a way to exclude the nfs stuff while doing a backup using dd, which isn't all that surprising because dd is supposed to make an exact replica of the disk being cloned, including any empty space. So we don't have that flexibility with dd. The advantage of using dd is that if it works, we have a plug-and-play clone of the boot disk and root filesystem which we can use in the event of a hard-disk failure.
I am trying option 3 now. dd however does requrie that the destination drive size be >= source drive size - I'm not sure if this is true for the HGST drives. lsblk suggests that the drive size is 1.8TB, while the boot disk, /dev/sda, is 2TB. Let's see if it works.
Backup of chiara is done. I checked that I could mount the external drive at /mnt and access the files. We should still do a check of trying to boot from the LaCie backup disk, need another computer for that.
nodus backup is still not complete according to the console - there is no progress indicator so we just have to wait I guess.
This is not quite right. First of all, /frames is not NFS. It's a mount of a local filesystem that happens to be on a RAID. Second, the frames RAID is mounted at /frames. If you do a dd of the underlying block device (in this case /dev/sda*, you're not going to copy anything that's mounted on top of it.
What i was saying about /frames is that I believe there is data in the underlying directory /frames that the frames RAID is mounted on top of. In order to not get that in the copy of /dev/sda4 you would need to unmount the frames RAID from /frames, and delete everything from the /frames directory. This would not harm the frames RAID at all.
But it doesn't really matter because the backup disk has space to cover the whole thing so just don't worry about it. Just dd /dev/sda to the backup disk and you'll just be copying the root filesystem, which is what we want.
The nodus backup too is now complete - however, I am unable to mount the backup disk anywhere. I tried on a couple of different machines (optimus, chiara and pianosa), but always get the same error:
mount: unknown filesystem type 'LVM2_member'
The disk itself is being recognized, and I can see the partitions when I run lsblk, but I can't get the disk to actually mount.
Doing a web-search, I came across a few blog posts that look like the problem can be resolved using the vgchange utility - but I am not sure what exactly this does so I am holding off on trying.
To clarify, I performed the cloning by running
sudo dd if=/dev/sda of=/dev/sdb bs=64K conv=noerror,sync
in a tmux session on nodus (as I did for chiara and FB1, latter backup is still running).
The FB1 dd backup process seems to have finished too - but I got the following message:
dd: error writing ‘/dev/sdc’: No space left on device
30523666+0 records in
30523665+0 records out
2000398934016 bytes (2.0 TB) copied, 50865.1 s, 39.3 MB/s
Running lsblk shows the following:
controls@fb1:~ 32$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 23.5T 0 disk
└─sdb1 8:17 0 23.5T 0 part /frames
sda 8:0 0 2T 0 disk
├─sda1 8:1 0 476M 0 part /boot
├─sda2 8:2 0 18.6G 0 part /var
├─sda3 8:3 0 8.4G 0 part [SWAP]
└─sda4 8:4 0 2T 0 part /
sdc 8:32 0 1.8T 0 disk
├─sdc1 8:33 0 476M 0 part
├─sdc2 8:34 0 18.6G 0 part
├─sdc3 8:35 0 8.4G 0 part
└─sdc4 8:36 0 1.8T 0 part
While I am able to mount /dev/sdc1, I can't mount /dev/sdc4, for which I get the error message
controls@fb1:~ 0$ sudo mount /dev/sdc4 /mnt/HGSTbackup/
mount: wrong fs type, bad option, bad superblock on /dev/sdc4,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
Looking at dmesg, it looks like this error is related to the fact that we are trying to clone a 2TB disk onto a 1.8TB disk - it complains about block size exceeding device size.
The 4TB HGST drives have arrived. I've started the FB1 dd backup process. Should take a day or so.
Looks to have worked this time around.
controls@fb1:~ 0$ sudo dd if=/dev/sda of=/dev/sdc bs=64K conv=noerror,sync
33554416+0 records in
33554416+0 records out
2199022206976 bytes (2.2 TB) copied, 55910.3 s, 39.3 MB/s
You have new mail in /var/mail/controls
I was able to mount all the partitions on the cloned disk. Will now try booting from this disk on the spare machine I am testing in the office area now. That'd be a "real" test of if this backup is useful in the event of a disk failure.
None of the 3 dd backups I made were bootable - at boot, selecting the drive put me into grub rescue mode, which seemed to suggest that the /boot partition did not exist on the backed up disk, despite the fact that I was able to mount this partition on a booted computer. Perhaps for the same reason, but maybe not.
After going through various StackOverflow posts / blogs / other googling, I decided to try cloning the drives using ddrescue instead of dd.
This seems to have worked for nodus - I was able to boot to console on the machine called rosalba which was lying around under my desk. I deliberately did not have this machine connected to the martian network during the boot process for fear of some issues because of having multiple "nodus"-es on the network, so it complained a bit about starting the elog and other network related issues, but seems like we have a plug-and-play version of the nodus root filesystem now.
chiara and fb1 rootfs backups (made using ddrescue) are still not bootable - I'm working on it.
Nov 6 2017: I am now able to boot the chiara backup as well - although mysteriously, I cannot boot it from the machine called rosalba, but can boot it from ottavia. Anyways, seems like we have usable backups of the rootfs of nodus and chiara now. FB1 is still a no-go, working on it.
controls@fb1:~ 0$ sudo dd if=/dev/sda of=/dev/sdc bs=64K conv=noerror,sync
33554416+0 records in
33554416+0 records out
2199022206976 bytes (2.2 TB) copied, 55910.3 s, 39.3 MB/s
You have new mail in /var/mail/controls
As part of the fb40m restart procedure (Sanjit and I were restarting it to add some new channels so they can be read by the OAF model), I checked up on how the backup has been going. Unfortunately the answer is: not well.
Alan imparted to me all the wisdom of frame builder backups on September 28th of this year. Except for the first 2 days of something having gone wrong (which was fixed at that time), the backup script hasn't thrown any errors, and thus hasn't sent any whiny emails to me. This is seen by opening up /caltech/scripts/backup/rsync.backup.cumlog , and noticing that after October 1, 2009, all of the 'errorcodes' have been zero, i.e. no error (as opposed to 'errorcode 2' when the backup fails).
However, when you ssh to the backup server to see what .gwf files exist, the last one is at gps time 941803200, which is Nov 9 2009, 11:59:45 UTC. So, I'm not sure why no errors have been thrown, but also no backups have happened. Looking at the rsync.backup.log file, it says 'Host Key Verification Failed'. This seems like something which isn't changing the errcode, but should be, so that it can send me an email when things aren't up to snuff. On Nov 10th (the first day the backup didn't do any backing-up), there was a lot of Megatron action, and some adding of StochMon channels. If the fb was restarted for either of these things, and the backup script wasn't started, then it should have had an error, and sent me an email. Since any time the frame builder's backup script hasn't been started properly it should send an email, I'm going to go ahead and blame whoever wrote the scripts, rather than the Joe/Pete/Alberto team.
Since our new raid disk is ~28 days of local storage, we won't have lost anything on the backup server as long as the backup works tonight (or sometime in the next few days), because the backup is an rsync, so it copies anything which it hasn't already copied. Since the fb got restarted just now, hopefully whatever funny business (maybe with the .agent files???) will be gone, and the backup will work properly.
I'll check in with the frame builder again tomorrow, to make sure that it's all good.
All is well again in the world of backups. We are now up to date as of ~midnight last night.
Backup Fail. At least this time however, it threw the appropriate error code, and sent me an email saying that it was unhappy. Alan said he was going to check in with Stuart regarding the confusion with the ssh-agent. (The other day, when I did a ps -ef | grep agent, there were ~5 ssh-agents running, which could have been then cause of the unsuccessful backups without telling me that they failed. The main symptom is that when I first restart all of the ssh-agent stuff, according to the directions in the Restart fb40m Procedures, I can do a test ssh over to ldas-cit, to see what frames are there. If I log out of the frame builder and log back in, then I can no longer ssh to ldas-cit without a password. This shouldn't happen....the ssh-agent is supposed to authenticate the connection so no passwords are necessary.)
I'm going to restart the backup script again, and we'll see how it goes over the long weekend.
Dan Kozak is rsync transferring /frames from NODUS over to the LDAS grid. He's doing this without a BW limit, but even so its going to take a couple weeks. If nodus seems pokey or the net connection to the outside world is too tight, then please let me and him know so that he can throttle the pipe a little.
The recently observed daqd flakiness looks related to this transfer. It appears to still be ongoing:
nodus:~>ps -ef | grep rsync
controls 29089 382 5 13:39:20 pts/1 13:55 rsync -a --inplace --delete --exclude lost+found --exclude .*.gwf /frames/trend
controls 29100 382 2 13:39:43 pts/1 9:15 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10975 131.
controls 29109 382 3 13:39:43 pts/1 9:10 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10978 131.
controls 29103 382 3 13:39:43 pts/1 9:14 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10976 131.
controls 29112 382 3 13:39:43 pts/1 9:18 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10979 131.
controls 29099 382 2 13:39:43 pts/1 9:14 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10974 131.
controls 29106 382 3 13:39:43 pts/1 9:13 rsync -a --delete --exclude lost+found --exclude .*.gwf /frames/full/10977 131.
controls 29620 29603 0 20:40:48 pts/3 0:00 grep rsync
Diagnosing the problem:
I logged into fb and ran "top". It said that fb was waiting for disk I/O ~60% of the time (according to the "%wa" number in the header). There were 8 nfsd (network file server) processes running with several of them listed in status "D" (waiting for disk). The daqd logs were ending with errors like the following suggesting that it couldn't keep up with the flow of data:
[Wed Oct 22 18:58:35 2014] main profiler warning: 1 empty blocks in the buffer
[Wed Oct 22 18:58:36 2014] main profiler warning: 0 empty blocks in the buffer
GPS time jumped from 1098064730 to 1098064731
This all pointed to the possibility that the file transfer load was too heavy.
Reducing the load:
The following configuration changes were applied on fb.
Edited /etc/conf.d/nfs to reduce the number of nfsd processes from 8 to 1:
Ran "ionice" to raise the priority of the framebuilder process (daqd):
controls@fb /opt/rtcds/rtscore/trunk/src/daqd 0$ sudo ionice -c 1 -p 10964
And to reduce the priority of the nfsd process:
controls@fb /opt/rtcds/rtscore/trunk/src/daqd 0$ sudo ionice -c 2 -p 11198
I also tried punishing nfsd with an even lower priority ("-c 3"), but that was causing the workstations to lag noticeably.
After these changes the %wa value went from ~60% to ~20%, and daqd seems to die less often, but some further throttling may still be in order.
Here's another useful link: