Development Workflow

The following walkthrough will guide you through a possible workflow when developing new notebooks for offline calibration.

Fresh Start

If you are starting a blank notebook from scratch you should first think about a few preconsiderations:

  • Will the notebook perform a headless task, or will it also be an important interface for evaluating the results in form of a report.
  • Do you need to run concurrently? Is concurrency handled internally, e.g. by use of ipcluster, or also on a host level, using cluster computing via slurm.

In case you plan on using the notebook as a report tool, you should make sure to provide sufficient guidance and textual details using e.g. markdown cells in the notebook. You should also structure it into appropriate subsections.

If you plan on running concurrently on the cluster, identify which variable should be mapped to concurrent runs. For autofilling it an integer list is needed.

Once you’ve clarified the above points, you should create a new notebook, either in an existing detector folder, or if for a yet not integrated detector, into a new folder with the detector’s name. Give it a suffix _NBC to denote that it is enabled for the tool chain.

You should then start writing your code following the guidelines below.

From Existing Notebook

Copy your existing notebook into the appropriate detector directory, or create a new one if the detector does not exist yet. Give the copy a suffix _NBC to denote that it is enabled for the tool chain.

You should then start restructuring your code following the guidelines below.

Title and Author Information

Especially for report generation the notebook should have a proper title author and version. These should be given in a leading markdown cell in the form:

# My Fancy Calculation #

Author: Jane Doe, Version 0.1

A description of the notebook.

Information in the format will allow automatic parsing of author and version.

Exposing Parameters to the Command Line

The European XFEL Offline Calibration toolkit automatically deduces command line arguments for Jupyter notebooks. It does this with an extended version of nbparameterise, originally written by Thomas Kluyver.

Parameter deduction tries to parse all variables defined in the first code cell of a notebook. The following variable types are supported:

  • numbers: ints and floats
  • Booleans
  • strings
  • lists of any of the above

You should avoid having import statements in this cell. Line comments can be used to define the help text provided by the command line interface, and to signify if lists can be constructed from ranges and if paramters are required:

in_folder = '/gpfs/exfel/exp/SPB/201830/p900019/raw' # path to input data, required
modules = [0] # modules to work on, required, range allowed
out_folder = "/gpfs/exfel/exp/SPB/201830/p900019/proc/calibration0618/FF" # path to output to, required
runs = [820,] # runs to use, required, range allowed
sequences = [0,1,2,3,4] # sequences files to use, range allowed
cluster_profile = "noDB" # The ipcluster profile to use
local_output = False # output constants locally

Here, in_folder and out_folder are required string values. Values for required parameters have to be given when executing from the command line. This means that any defaults given in the first cell of the code are ignored (they are only used to derive the type of the parameter). Modules is a list, which from the command line could also be assigned using a range expression, e.g. 5-10,12,13,18-21, which would translate to 5,6,7,8,9,12,13,18,19,20. It is also a required parameter. The parameter local_output is a Boolean. The corresponding argument given in the command line will change this parameter from false to True. There is no way to change this parameter from True to False from the command line.

The cluster_profile parameter is a bit special, in that the tool kit expects exactly this name to provide the profile name for an ipcluster being run. Hence you use ipcluster for parallelisation, define your profile name in this variable.

The excerpt above is from a flat field characterization notebook for AGIPD. The code would lead to the following parameters being exposed via the command line:

% xfel-calibrate AGIPD FF --help
usage: xfel-calibrate.py [-h] --in-folder str [--modules str [str ...]]
                        --out-folder str --runs str [str ...]
                        [--sequences str [str ...]] [--cluster-profile str]
                        [--local-output] [--db-output] [--bias-voltage int]
                        [--cal-db-interface str] [--mem-cells int]
                        [--interlaced] [--fit-hook] [--rawversion int]
                        [--instrument str] [--photon-energy float]
                        [--offset-store str] [--high-res-badpix-3d]
                        [--db-input] [--deviation-threshold float]
                        DETECTOR TYPE

Main entry point for offline calibration

positional arguments:
  DETECTOR              The detector to calibrate
  TYPE                  Type of calibration: LPD,AGIPD

optional arguments:
  -h, --help            show this help message and exit
  --no-cluster-job      Do not run as a cluster job
  --report-to str       Filename (and optionally path) for output report
  --modules str [str ...]
                        modules to work on, required, range allowed.
                        Default: [0]
  --sequences str [str ...]
                        sequences files to use, range allowed.
                        Default: [0, 1, 2, 3, 4]
  --cluster-profile str
                        The ipcluster profile to use. Default: noDB2

  --local-output        output constants locally. Default: False

...

Note

nbparameterise can only parse the mentioned subset of variable types. An expression that evaluates to such a type will note be recognized: e.g. a = list(range(3)) will not work!

The following table contains a list of suggested names for certain parameters, allowing to stay consistent amongst all notebooks.

Parameter name To be used for Special purpose
in_folder the input path data resides in, usually without a run number  
out_folder path to write data out to, usually without a run number reports can be placed here
run(s) which XFEL DAQ runs to use, often ranges are allowed  
modules refers to the modules of a segmented detector, ranges often ok.  
sequences sequence files for the XFEL DAQ system, ranges are often ok.  
cluster_profile name of the cluster profile for ipcluster fixed name
local_input read calibration constant from file, not database  
local_output write calibration constant from file, not database  
db_input read calibration constant from database, not file  
db_output write calibration constant from database, not file  
cal_db_interface the calibration database host in form of “tcp://host:port  

Best Coding Practices

In principle there a not restrictions other than that parameters that are exposed to the command line need to be defined in the first code cell of the notebook.

However, a few guidelines should be observed to make notebook useful for display as reports and usage by others.

External Libraries

You may use a wide variety of libraries available in Python, but keep in mind that others wanting to run the tool will need to install these requirements as well. Thus,

  • Do not use a specialized tool if an accepted alternative exists. Plots e.g. should usually be created using matplotlib and numerical processing should be done in numpy.
  • Keep runtime and library requirements in mind. A library doing its own parallelism either needs to programmatically be able to set this up, or automatically do so. If you need to start something from the command line first, things might be tricky as you will likely need to run this via POpen commands with appropriate environment variable.
  • Reading out RAW data should be done using extra_data. It helps in accessing the HDF5 data structures efficiently. It reduces the complexity of accessing the RAW or CORRECTED datasets, and it provides different methods to select and filter the trains, cells, or pixels of interest.

Writing out data

If your notebook produces output data, consider writing data out as early as possible, such that it is available as soon as possible. Detailed plotting and inspection can be done later on in the notebook.

Also use HDF5 via h5py as your output format. If you correct or calibrate input data, which adheres to the XFEL naming convention, you should maintain the convention in your output data. You should not touch any data that you do not actively work on and should assure that the INDEX and identifier entries are synchronized with respect to your output data. E.g. if you remove pulses from a train, the INDEX/…/count section should reflect this.

Plotting

When creating plots, make sure that the plot is either self-explanatory or add markdown comments with adequate description. Do not add “free-floating” plots, always put them into a context. Make sure to label your axes.

Also make sure the plots are readable on an A4-sized PDF page; this is the format the notebook will be rendered to for report outputs. Specifically, this means that figure sizes should not exceed approx 15x15 inches.

The report will contain 150 dpi png images of your plots. If you need higher quality output of individual plot files you should save these separately, e.g. via fig.savefig(…) yourself.

Calibration Database Interaction

Tasks which require calibration constants or produce such should do this by interacting with the European XFEL calibration database.

In terms of development workflow it is usually easier to work with file-based I/O first and only switch over to the database after the algorithmic part of the notebook has matured. Reasons for this include:

  • for developing against the database new constants will have to be integrated therein first
  • if the parameters a constant depends on change a lot during early development these updates will always have to be propagated to the database manually
  • database access is limited to the XFEL networks, making offline development more difficult.

Once a stable point is reached, database access can be enabled according to the iCalibrationDB documentation.

Testing

The most important test is that your notebook completes flawlessly outside any special tool chain feature. After all, the tool chain will only replace parameters, and then launch a concurrent job and generate a report out of notebook. If it fails to run in the normal Jupyter notebook environment, it will certainly fail in the tool chain environment.

Once you are satisfied with your current state of initial development, you can add it to the list of notebooks as mentioned in the Configuration section.

Any changes you now make in the notebook will be automatically propagated to the command line. Specifically, you should verify that all arguments are parsed correctly, e.g. by calling:

xfel-calibrate DETECTOR NOTEBOOK_TYPE --help

From then on, check include if parallel slurm jobs are executed correctly and if a report is generated at the end.

Finally, you should verify that the report contains the information you’d like to convey and is intelligible to people other than you.

Note

You can run the xfel-calibrate command without starting a SLURM cluster job, giving you direct access to console output, by adding the –no-cluster-job option.

Documenting

Most documentation should be done in the notebook itself. Any notebooks specified in the notebook.py file will automatically show up in the Available Notebooks section of this documentation.