I compared the fb1 in main network with the cloned fb1 and I found a crucial difference. The main fb1 where cds is running fine as mx devices mounted in /dev/ like mx0, mx1 upto mx7, mxctlm mxctlp, mxp0, mxp1 upto mxp7. The cloned fb does not have any of these mx devices mounted. I think this is where the issue was coming in from.
However, lspci | grep 'Myri' shows following output on both computers:
controls@fb1:/dev 0$ lspci | grep 'Myri'
02:00.0 Ethernet controller: MYRICOM Inc. Myri-10G Dual-Protocol NIC (rev 01)
Which means that the computer detects the card on PCie slot.
I tried to add this to /etc/rc.local to run this script at every boot, but it did not work. So for now, I'll just manually do this step everytime. Once the devices are loaded, we get:
controls@fb1:/etc 0$ ls /dev/*mx*
/dev/mx0 /dev/mx4 /dev/mxctl /dev/mxp2 /dev/mxp6 /dev/ptmx
/dev/mx1 /dev/mx5 /dev/mxctlp /dev/mxp3 /dev/mxp7
/dev/mx2 /dev/mx6 /dev/mxp0 /dev/mxp4 /dev/open-mx
/dev/mx3 /dev/mx7 /dev/mxp1 /dev/mxp5 /dev/open-mx-raw
The, restarting all daqd_ processes, I found that daqd_dc was running succesfully now. Here is the status:
controls@fb1:/etc 0$ sudo systemctl status daqd_* -l
● daqd_dc.service - Advanced LIGO RTS daqd data concentrator
Loaded: loaded (/etc/systemd/system/daqd_dc.service; enabled)
Active: active (running) since Mon 2021-10-11 17:48:00 PDT; 23min ago
Main PID: 2308 (daqd_dc_mx)
CGroup: /daqd.slice/daqd_dc.service
├─2308 /usr/bin/daqd_dc_mx -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.dc
└─2370 caRepeater
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: mx receiver 006 thread priority error Operation not permitted[Mon Oct 11 17:48:06 2021]
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: mx receiver 005 thread put on CPU 0
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:48:06 2021] [Mon Oct 11 17:48:06 2021] mx receiver 006 thread put on CPU 0
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: mx receiver 007 thread put on CPU 0
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:48:06 2021] mx receiver 003 thread - label dqmx003 pid=2362
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:48:06 2021] mx receiver 003 thread priority error Operation not permitted
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:48:06 2021] mx receiver 003 thread put on CPU 0
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: warning:regcache incompatible with malloc
Oct 11 17:48:07 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:48:06 2021] EDCU has 410 channels configured; first=0
Oct 11 17:49:06 fb1 daqd_dc_mx[2308]: [Mon Oct 11 17:49:06 2021] ->4: clear crc
● daqd_fw.service - Advanced LIGO RTS daqd frame writer
Loaded: loaded (/etc/systemd/system/daqd_fw.service; enabled)
Active: active (running) since Mon 2021-10-11 17:48:01 PDT; 23min ago
Main PID: 2318 (daqd_fw)
CGroup: /daqd.slice/daqd_fw.service
└─2318 /usr/bin/daqd_fw -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.fw
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] [Mon Oct 11 17:48:09 2021] Producer thread - label dqproddbg pid=2440
Oct 11 17:48:09 fb1 daqd_fw[2318]: Producer crc thread priority error Operation not permitted
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] [Mon Oct 11 17:48:09 2021] Producer crc thread put on CPU 0
Oct 11 17:48:09 fb1 daqd_fw[2318]: Producer thread priority error Operation not permitted
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] Producer thread put on CPU 0
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] Producer thread - label dqprod pid=2434
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] Producer thread priority error Operation not permitted
Oct 11 17:48:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:09 2021] Producer thread put on CPU 0
Oct 11 17:48:10 fb1 daqd_fw[2318]: [Mon Oct 11 17:48:10 2021] Minute trender made GPS time correction; gps=1318034906; gps%60=26
Oct 11 17:49:09 fb1 daqd_fw[2318]: [Mon Oct 11 17:49:09 2021] ->3: clear crc
● daqd_rcv.service - Advanced LIGO RTS daqd testpoint receiver
Loaded: loaded (/etc/systemd/system/daqd_rcv.service; enabled)
Active: active (running) since Mon 2021-10-11 17:48:00 PDT; 23min ago
Main PID: 2311 (daqd_rcv)
CGroup: /daqd.slice/daqd_rcv.service
└─2311 /usr/bin/daqd_rcv -c /opt/rtcds/caltech/c1/target/daqd/daqdrc.rcv
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1X07_CRC_SUM
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1BHD_STATUS
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1BHD_CRC_CPS
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1BHD_CRC_SUM
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1SU2_STATUS
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1SU2_CRC_CPS
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1SU2_CRC_SUM
Oct 11 17:50:21 fb1 daqd_rcv[2311]: Creating C1:DAQ-NDS0_C1OM[Mon Oct 11 17:50:21 2021] Epics server started
Oct 11 17:50:24 fb1 daqd_rcv[2311]: [Mon Oct 11 17:50:24 2021] Minute trender made GPS time correction; gps=1318035040; gps%120=40
Oct 11 17:51:21 fb1 daqd_rcv[2311]: [Mon Oct 11 17:51:21 2021] ->3: clear crc
Now, even before starting teh FE models, I see DC status as ox2bad in the CDS screens of the IOP and user models. The mx_stream service remains in a failed state at teh FE machines and remain the same even after restarting the service.
controls@c1sus2:~ 0$ sudo systemctl status mx_stream -l
● mx_stream.service - Advanced LIGO RTS front end mx stream
Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
Active: failed (Result: exit-code) since Mon 2021-10-11 17:50:26 PDT; 15min ago
Process: 382 ExecStart=/etc/mx_stream_exec (code=exited, status=1/FAILURE)
Main PID: 382 (code=exited, status=1/FAILURE)
Oct 11 17:50:25 c1sus2 systemd[1]: Starting Advanced LIGO RTS front end mx stream...
Oct 11 17:50:25 c1sus2 systemd[1]: Started Advanced LIGO RTS front end mx stream.
Oct 11 17:50:25 c1sus2 mx_stream_exec[382]: Failed to open endpoint Not initialized
Oct 11 17:50:26 c1sus2 systemd[1]: mx_stream.service: main process exited, code=exited, status=1/FAILURE
Oct 11 17:50:26 c1sus2 systemd[1]: Unit mx_stream.service entered failed state.
But if I restart the mx_stream service before starting the rtcds models, the mx-stream service starts succesfully:
controls@c1sus2:~ 0$ sudo systemctl restart mx_stream
controls@c1sus2:~ 0$ sudo systemctl status mx_stream -l
● mx_stream.service - Advanced LIGO RTS front end mx stream
Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
Active: active (running) since Mon 2021-10-11 18:14:13 PDT; 25s ago
Main PID: 1337 (mx_stream)
CGroup: /system.slice/mx_stream.service
└─1337 /usr/bin/mx_stream -e 0 -r 0 -w 0 -W 0 -s c1x07 c1su2 -d fb1:0
Oct 11 18:14:13 c1sus2 systemd[1]: Starting Advanced LIGO RTS front end mx stream...
Oct 11 18:14:13 c1sus2 systemd[1]: Started Advanced LIGO RTS front end mx stream.
Oct 11 18:14:13 c1sus2 mx_stream_exec[1337]: send len = 263596
Oct 11 18:14:13 c1sus2 mx_stream_exec[1337]: Connection Made
However, the DC status on CDS screens still show 0x2bad. As soon as I start the rtcds model c1x07 (the IOP model for c1sus2), the mx_stream service fails:
controls@c1sus2:~ 0$ sudo systemctl status mx_stream -l
● mx_stream.service - Advanced LIGO RTS front end mx stream
Loaded: loaded (/etc/systemd/system/mx_stream.service; enabled)
Active: failed (Result: exit-code) since Mon 2021-10-11 18:18:03 PDT; 27s ago
Process: 1337 ExecStart=/etc/mx_stream_exec (code=exited, status=1/FAILURE)
Main PID: 1337 (code=exited, status=1/FAILURE)
Oct 11 18:14:13 c1sus2 systemd[1]: Starting Advanced LIGO RTS front end mx stream...
Oct 11 18:14:13 c1sus2 systemd[1]: Started Advanced LIGO RTS front end mx stream.
Oct 11 18:14:13 c1sus2 mx_stream_exec[1337]: send len = 263596
Oct 11 18:14:13 c1sus2 mx_stream_exec[1337]: Connection Made
Oct 11 18:18:03 c1sus2 mx_stream_exec[1337]: isendxxx failed with status Remote Endpoint Unreachable
Oct 11 18:18:03 c1sus2 mx_stream_exec[1337]: disconnected from the sender
Oct 11 18:18:03 c1sus2 mx_stream_exec[1337]: c1x07_daq mmapped address is 0x7fe3620c3000
Oct 11 18:18:03 c1sus2 mx_stream_exec[1337]: c1su2_daq mmapped address is 0x7fe35e0c3000
Oct 11 18:18:03 c1sus2 systemd[1]: mx_stream.service: main process exited, code=exited, status=1/FAILURE
Oct 11 18:18:03 c1sus2 systemd[1]: Unit mx_stream.service entered failed state.
This shows that the start of rtcds model, causes the fail in mx_stream, possibly due to inability of finding the endpoint on fb1. I've again reached to the edge of my knowledge here. Maybe the fiber optic connection between fb and the network switch that connects to FE is bad, or the connection between switch and FEs is bad.
But we are just one step away from making this work.
|