Debugging the CalibrationDbRemote Devices

The so-called calibrationDbRemote devices are the access points to the calibration database, both for the online and the offline services. Retrieving and injecting constants through them are performed using ZMQ requests. The devices themselves run in a Karabo environment.

For online calibration one should check the procedures at General Online Calibration Troubleshooting for how to identify and debug problems with these devices.

The offline devices run under the xcal account on max-exfl016 in a Karabo environment installed in the /scratch/xcal/karabo folder. This installation is maintained by ITDM. Note the the folder is only accessible from max-exfl016; it is not shared across Maxwell hosts.

Inspecting Logs

To inspect logs, log onto xcal@max-exfl016 and tail the log file of the device server:

ssh xcal@max-exfl016
tail -500f /scratch/xcal/karabo/var/log/pythonserver_max_web_rcal_0/current

This should result in an output similar to:

INFO  MAX_WEB_DATA/DM/CAL_REMOTE_5  : Waiting on data
INFO  MAX_WEB_DATA/DM/CAL_REMOTE_5  : Saved file cal.1567178413.0473201.h5 at xfel/cal/agipd-type/agipd_siv1_agipdv11_m400/
INFO  MAX_WEB_DATA/DM/CAL_REMOTE_5  : Start writing to Calibration Catalogue...
INFO  MAX_WEB_DATA/DM/CAL_REMOTE_5  : Finnish writing to Calibration Catalogue...
INFO  MAX_WEB_DATA/DM/CAL_REMOTE_5  : Registered cal.1567178413.0473201.h5 successfully

for when parameters are being written to the database, or:

INFO  MAX_WEB_DATA/DM/CAL_REMOTE_0  : Waiting on data
INFO  MAX_WEB_DATA/DM/CAL_REMOTE_0  : Start searching in Calibration Catalogue...
INFO  MAX_WEB_DATA/DM/CAL_REMOTE_0  : Got calibration: Offset (AGIPD1M1), meta_only: True
INFO  MAX_WEB_DATA/DM/CAL_REMOTE_0  : Waiting on data
INFO  MAX_WEB_DATA/DM/CAL_REMOTE_0  : Start searching in Calibration Catalogue...
INFO  MAX_WEB_DATA/DM/CAL_REMOTE_0  : Got calibration: BadPixelsDarkCCD (FastCCD1), meta_only: False

when retrieving parameters from the database.

There are a total of 30 devices acting as access points, and most calibration routines will randomly pick one of these and retry contacting another one if it is not available.

Error Scenarios

The following error scenarios have been known to occur:

  • Calibration database devices do not respond to queries. This can be due to a temporary overload of the server, such that the processes running the access points were not able to allocate sufficient memory. This should not happen anymore if meta_data only retrieval mode is used for larger parameters, like those of AGIPD and LPD. It can be diagnosed by the log outputs not updating even though querying jobs have been launched.

    It can be solved by restarting the calibrationDbRemote server:

    source /scratch/xcal/karabo/activate
    karabo-kill -k pythonserver_max_web_rcal_0
    

    Note that this might lead to errors on calibration jobs currently in flight, when the restart occurs. All devices will autostart after the server restart. The procedure takes about 30 seconds until the system will be running again.

    To diagnose if the problems occured through requesting large constants, while not using meta_only mode, check memory consumption via htop:

    htop

    and assure that no swapping had occurred.

    ../../_images/cal_db_debugging_htop.png

    Fig. 20 Example output of htop command. The swap bar should be near zero.

  • Unusually high number of request, of the same parameters can occur if ZMQ requests with improper timeout error handling are being made. This will lead to recurring log entries requesting the same parameters. Current production code has long enough timeouts set, but legacy code is known to run into this scenario. To fix it, first check if you can find processes that might be responsible for the requests:

    squeue -u xcal
    

    and stop these, e.g. by using scancel. The restart the calibration database devices as mentioned above. Make sure to increase timeouts on calling code before initiating new requests.

  • Reports show `Version already taken` error. This is not really a failure scenario, but rather expected behaviour. A parameter with exactly the same start date and operating conditions has previously been injected into the calibration database. The database is now refusing to overwrite it. Check if you are characterizing the correct data with the correct conditions.