Skip to content

Development Workflow

We welcome contributions to the pipeline if you have calibration notebooks or algorithms that you believe could be useful. In order to facilitate the development process, we have provided a section that outlines the key points to consider during the development of new features. This section is designed to assist you throughout the development and review process, and ensure that your contributions are consistent with the pipeline's requirements. We believe that these guidelines will be helpful in creating a seamless development process and result in high-quality contributions that benefit the pipeline. If you have any questions or concerns regarding the development process, please do not hesitate to reach out to us for assistance. We look forward to working with you to enhance the pipeline's capabilities.

Developing a notebook from scratch

Developing a notebook from scratch can be a challenging but rewarding process. Here are some key steps to consider:

  1. Define the purpose

    Start identifying what are you trying to solve and the task you want to perform with your notebook.

    • Does the user need to execute the notebook interactively?
    • Should it run the same way as the production notebooks? It is recommended that the notebook is executed in the same way as the production notebooks through xfel-calibrate CLI.
    xfel-calibrate CLI is essential

    If xfel-calibrate CLI is essential, you need to follow the guidelines in where and how to write the variables in the first notebook cell and how to include it as one of the CLI calibration options to execute.

    • Does the notebook need to generate a report at the end to display its results or can it run without any user interaction?
    A report is needed

    If a report is needed you should make sure to provide sufficient guidance and textual details using markdown cells and clear prints within the code. You should also structure the notebook cells into appropriate subsections.

  2. Plan you work flow Map out the steps your notebook will take. From data ingestion to analyzing results and visualization.

    • What are the required data sources that the notebook needs to access or utilize? For example, GPFS or calibration database.
    • Can the notebook's internal concurrency be optimized through the use of multiprocessing or is it necessary to employ host-level cluster computing with SLURM to achieve higher performance?
    SLURM concurrency is needed

    If SLURM concurrency is needed, you need to identify the variable that the notebook will be replicated based on to split the processing.

    • What visualization tools or techniques are necessary to provide an overview of the processing results generated by the notebook? Can you give examples of charts, graphs, or other visual aids that would be useful for understanding the output?
  3. Write the code and include documentation

    Begin coding your notebook based on your workflow plan. Use comments to explain code blocks and decisions. - PEP 8 styling code is highly recommended. It leads to code that is easier to read, understand, and maintain. Additionally, it is a widely accepted standard in the Python community, and following it make your code more accessible to other developers and improve collaboration. - Google style docstrings is our recommended way of documenting the code. By providing clear and concise descriptions of your functions and methods, including input and output parameters, potential exceptions, and other important details, you make it easier for other developers to understand the code, and for the used mkdocs documentation to reference it.

  4. Document the notebook and split into sections.

    Enriching a notebook with documentation is an important step in creating a clear and easy-to-follow guide for others to use:

    • Use Markdown cells to create titles and section headings: By using Markdown cells, you can create clear and descriptive headings for each section of your notebook. This makes it easier to navigate and understand the content of the notebook, but more importantly these are parsed while creating the PDF report using sphinx.
    • Add detailed explanations to each section.
    • Add comments to your code.
  5. Test and refine Test your notebook thoroughly to identify any issues. Refine your code and documentation as needed to ensure your notebook is accurate, efficient, and easy to use.

  6. Share and collaborate

    Share your notebook on GitLab to start seeking feedback and begin the reviewing process.

Write notebook to execute using xfel-calibrate

To start developing a new notebook, you either create it in an existing detector directory or create a new directory for it with the new detector's name. Give it a suffix _NBC to denote that it is enabled for the tool chain.

You should then start writing your code following these guidelines

  • First markdown cell goes for Title, author, and notebook description. This is automatically parsed in the report.
  • First code cell must have all parameter that will be exposed to xfel-calibrate CLI
  • Second code cell for importing all needed libraries and methods.
  • The following code cells and markdown cells are for data ingestion, data processing, and data visualization. Markdown cells are very important as it will be parsed as the main source of report text and documentation after the calibration notebook is executed.

Exposing parameters to xfel-calibrate

The European XFEL Offline Calibration toolkit automatically deduces command line arguments from Jupyter notebooks. It does this with an extended version of nbparameterise, originally written by Thomas Kluyver.

Parameter deduction tries to parse all variables defined in the first code cell of a notebook. The following variable types are supported:

  • Numbers(INT or FLOAT)
  • Booleans
  • Strings
  • Lists of the above

You should avoid having import statements in this cell. Line comments can be used to define the help text provided by the command line interface, and to signify if lists can be constructed from ranges and if parameters are required::

in_folder = ""  # directory to read data from, required
out_folder = ""  # directory to output to, required
metadata_folder = ""  # directory containing calibration metadata file when run by xfel-calibrate
run = [820, ]  # runs to use, required, range allowed
sequences = [0, 1, 2, 3, 4]  # sequences files to use, range allowed
modules = [0]  # modules to work on, required, range allowed

karabo_id = "MID_DET_AGIPD1M-1"  # Detector karabo_id name
karabo_da = [""]  # a list of data aggregators names, Default [-1] for selecting all data aggregators

skip-plots = False  # exit after writing corrected files and metadata

The above are some examples of parameters from AGIPD correction notebook.

  • Here, in_folder and out_folder are set as required string values.

Values for required parameters have to be given when executing from the command line. This means that any defaults given in the first cell of the code are ignored (they are only used to derive the type of the parameter).

  • modules and sequences are lists of integers, which from the command line could also be assigned using a range expression, e.g. 5-10,12,13,18-21, which would translate to 5,6,7,8,9,12,13,18,19,20.

Warning

nbparameterise can only parse the mentioned subset of variable types. An expression that evaluates to such a type will not be recognized. e.g. a = list(range(3)) will not work!

  • karabo_id is a string value indicating the detector to be processed.
  • karabo_da is a list of strings to indicate the detector's modules to be processed. As karabo_da and modules are two different variables pointing to the same physical parameter. In the later notebook cells both parameters are synced before usage.

  • skip-plots is a boolean for skipping the notebook plots to save time and deliver the report as soon as the data are processed. to set skip-plots to False from the command line. --no-skip-plots is used.

The table below provides a set of recommended parameter names to ensure consistency across all notebooks.

Parameter name To be used for Special purpose
in_folder the input path data resides in, usually without a run number.
out_folder path to write data out to, usually without a run number. reports can be placed here
metadata_folder directory path for calibration metadata file with local constants.
run(s) which XFEL DAQ runs to use, often ranges are allowed.
karabo_id detector karabo name to access detector files and constants.
karabo_da refers to detector's modules data aggregator names to process.
modules refers to the detector's modules indices to process, ranges often ok.
sequences sequence files for the XFEL DAQ system, ranges are often ok.
local_output write calibration constant from file, not database.
db_output write calibration constant from database, not file. saves the database from unintentional constant
injections developments or testing.

External Libraries

You may use a wide variety of libraries available in Python, but keep in mind that others wanting to run the tool will need to install these requirements as well. Therefore::

  • It is generally advisable to avoid using specialized tools or libraries unless there is a compelling reason to do so. Instead, it is often better to use well-established and widely-accepted alternatives that are more likely to be familiar to other developers and easier to install and use. For example, when creating visualizations, it is recommended to use the popular and widely-used library, matplotlib for charts, graphs and other visualisation. Similarly, numpy is widely used when performing numerical processing tasks.

  • When developing software, it is important to keep in mind the runtime and library requirements for your application. In particular, if you are using a library that performs its own parallelism, you will need to ensure that it can either set up this parallelism programmatically or do so automatically. If you need to start your application from the command line, there may be additional challenges to consider.

  • Reading out EXFEL RAW data is encouraged to be done using extra_data. This tool is designed to facilitate efficient access to data structures stored in HDF5 format. By simplifying the process of accessing RAW or CORRECTED datasets, it allows users to quickly and easily select and filter the specific trains, cells, or pixels of interest. This can greatly reduce the complexity and time required for data analysis, and enables researchers to more effectively explore and analyze large datasets.

Writing out data

If your notebook produces output data, consider writing data out as early as possible, such that it is available as soon as possible. Detailed plotting and inspection can be done later on in the notebook.

Also use HDF5 via h5py as your output format. If you correct or calibrate input data, which adheres to the XFEL naming convention, you should maintain the convention in your output data. You should not touch any data that you do not actively work on and should assure that the INDEX and identifier entries are synchronized with respect to your output data. E.g. if you remove pulses from a train, the INDEX/.../count section should reflect this. cal_tools.files module helps you achieve this easily.

Plotting

When creating plots, make sure that the plot is either self-explanatory or add markdown comments with adequate description. Do not add "free-floating" plots, always put them into a context. Make sure to label your axes.

Also make sure the plots are readable on an A4-sized PDF page; this is the format the notebook will be rendered to for report outputs. Specifically, this means that figure sizes should not exceed approx 15x15 inches.

The report will contain 150 dpi PNG images of your plots. If you need higher quality output of individual plot files you should save these separately, e.g. via fig.savefig(...) yourself.

xfel-calibrate execution

The package utilizes tools such as nbconvert and nbparameterise to expose Jupyter notebooks to a command line interface. In the process reports are generated from these notebooks.

The general interface is:

% xfel-calibrate DETECTOR TYPE

where DETECTOR and TYPE specify the task to be performed.

Additionally, it leverages the DESY/XFEL Maxwell cluster to run these jobs in parallel via SLURM.

Here is a list of available_notebooks.

Interaction with the calibration database

During development, it is advised to work with local constant files first before injecting any calibration constants to the production database. After the notebook's algorithms arguments matured one can switch over to the test database and then production database. The reason for this is to avoid injecting wrong constants that can affect production calibration. And to avoid unnecessary intervention to disable wrong or unused injected calibration constants.

Additionally, the calibration database is limited to XFEL networks, so independent development improves the workflow.

Testing

The most important test is that your notebook completes flawlessly outside any special tool chain feature. After all, the tool chain will only replace parameters, and then launch a concurrent job and generate a report out of notebook. If it fails to run in the normal Jupyter notebook environment, it will certainly fail in the tool chain environment.

Once you are satisfied with your current state of initial development, you can add it to the list of notebooks as mentioned in the configuration section.

Any changes you now make in the notebook will be automatically propagated to the command line. Specifically, you should verify that all arguments are parsed correctly, e.g. by calling::

  xfel-calibrate DETECTOR NOTEBOOK_TYPE --help

From then on, check include if parallel SLURM jobs are executed correctly and if a report is generated at the end.

Finally, you should verify that the report contains the information you'd like to convey and is intelligible to people other than you.

Note

You can run the xfel-calibrate command without starting a SLURM cluster job, giving you direct access to console output, by adding the --no-cluster-job option.

Documenting

Most documentation should be done in the notebook itself. Any notebooks specified in the notebook.py file will automatically show up in the Available Notebooks section of this documentation.