3. Data files

3.1. Data policy

The data policy of European XFEL is available at https://www.xfel.eu/users/policies/index_eng.html

3.2. Data folders

On both the Online cluster and the Offline cluster, European XFEL data is stored in /gpfs/exfel/exp. Each proposal has its own directory, so your data will be available at a path something like:

/gpfs/exfel/exp/SPB/201830/p900022/

The raw data from each run goes in a subfolder such as raw/r0104. Once this has been migrated to the offline cluster, corrected detector data will be automatically produced in another subfolder such as proc/r0104.

3.3. Reading data in Python

We provide a Python package extra_data to read data from European XFEL.

3.4. Combining detector data from multiple modules

The pixel detectors (AGIPD and LPD) record data in separate files for each of their 16 modules.

The EXtra-data Python library can combine detector modules into a numpy array (example)

Alternatively, the modules can be combined in a single view as an HDF5 virtual dataset with the extra-data-make-virtual-cxi command, allowing the data to be processed by CrystFEL, for instance.

3.5. Geometry files

Geometry files specify the location of the detector modules in real space. Please contact your instrument scientist regarding obtaining geometry files for the detector at each instrument.

EXtra-geom is a Python library used to describe the physical layout of multi-module detectors at European XFEL, and to assemble complete detector images.

One geometry file for the FXE LPD detector is available online.

One geometry file for the SPB/SFX AGIPD1M detector is available from https://cxidb.org/id-83.html (Experiment by Anton Barty).

EXtra-geom can read both file formats, see for example Assembling detector data into images.

GeoAssembler can be used to create or adjust geometry files by visually moving detector quadrants around.

A more systematic provision of geometry files is in preparation.

3.6. Data format

Experimental data are taken in the context of the following categories:

  • instruments: each instrument has their own label. For each instrument, there are multiple cycles:

  • cycle: a scheduling period in which multiple user experiments will take place (of the order of months). Within each cycle, there are multiple proposals:

  • proposals: (a.k.a. beamtimes) each user experiment proposal gets a number here. Within that proposal, users may carry out runs which can be logically associated to different experiments and samples:

  • runs: when a user starts acquiring data, then a new run starts, until that data acquisition is stopped. Within a run, there are trains labelled by a unique train Id:

  • train id: there are 10 pulse trains per second. Each train can carry multiple pulses

  • pulse id: up to 2700 pulses per train, individually counted.Counter starts from zero for every train.

We distinguish different types of data:

  • Control data has one entry for each train, even if the value changes less often than that.

  • Instrument data may have zero, one or multiple entries per train. Your main experimental results, e.g. from X-ray detectors, will usually be instrument data.

  • Run data is a superset of Control data, captured once per run.

Data is stored in HDF5 files; there may be tens to thousands of files in a single run. We aim to enable you to analyse data without needing to know the details of the file structure, e.g. by using EXtra-data in Python, or by generating a CXI file to represent a run. If you do need to read the EuXFEL HDF5 files yourself, however, the structure is described in the data files format page of the EXtra-data docs.

3.6.1. HDF5 chunking & compression

Both raw and corrected data may be stored using the HDF5 chunked layout. Some parts of the corrected data are compressed using the gzip compression filter in HDF5. In particular, detector gain stage and mask datasets compress well, saving a lot of disk space.

You can examine compression and chunk sizes using the GUI HDF View tool, our h5glance command line tool, or h5ls -v:

$ h5glance /gpfs/exfel/exp/XMPL/201750/p700000/raw/r0803/RAW-R0803-AGIPD00-S00000.h5 \
  INSTRUMENT/SPB_DET_AGIPD1M-1/DET/0CH0:xtdf/image/data
/gpfs/exfel/exp/XMPL/201750/p700000/raw/r0803/RAW-R0803-AGIPD00-S00000.h5/INSTRUMENT/SPB_DET_AGIPD1M-1/DET/0CH0:xtdf/image/data
      dtype: uint16
      shape: 16000 × 2 × 512 × 128
   maxshape: Unlimited × 2 × 512 × 128
     layout: Chunked
      chunk: 16 × 2 × 512 × 128
compression: None (options: None)
...

$ h5ls -v /gpfs/exfel/exp/XMPL/201750/p700000/raw/r0803/RAW-R0803-AGIPD00-S00000.h5/INSTRUMENT/SPB_DET_AGIPD1M-1/DET/0CH0:xtdf/image/data
Opened "/gpfs/exfel/exp/XMPL/201750/p700000/raw/r0803/RAW-R0803-AGIPD00-S00000.h5" with sec2 driver.
data                     Dataset {16000/Inf, 2/2, 512/512, 128/128}
    Location:  1:12333
    Links:     1
    Modified:  2017-11-20 04:57:44 CET
    Chunks:    {16, 2, 512, 128} 4194304 bytes
    Storage:   4194304000 logical bytes, 4194304000 allocated bytes, 100.00% utilization
    Type:      native unsigned short

The compressed datasets are stored with a single detector frame per chunk, to minimise the impact on analysis code reading the data.

If you observe pathologically slow reading, check whether you are accessing a compressed dataset with a chunk size larger than one frame. HDF5 decompresses an entire chunk at once, and it may be redoing this for each frame you read. You can avoid this by setting a cache size large enough to hold one complete chunk. The necessary C code looks something like this:

hid_t dapl = H5Pcreate(H5P_DATASET_ACCESS);
// Set a 32 MB cache size (calculate at least the size of one chunk)
H5Pset_chunk_cache(dapl, H5D_CHUNK_CACHE_NSLOTS_DEFAULT, 32 * 1024 * 1024, 1);
hid_t h5_dataset_id = H5Dopen(h5_file_id, ".../image/gain", dapl);

To benefit from chunk caching, you need to reuse the opened dataset ID for successive reads, instead of opening and closing it to read each frame.

3.7. Example data

Some example datasets are available so you can try reading the files before your experiment. There may be differences, e.g. in naming, when you collect new data, so it’s a good idea to talk to the relevant instrument group and the data analysis group at European XFEL as well.

3.7.1. Example runs on Maxwell

We prepared an environment to mimic real experiment data cycle at the European XFEL. For this, we have a fake instrument called XMPL which contains runs giving an overview of the data to expect. This data is made available on Maxwell:

/gpfs/exfel/exp/XMPL/201750/p700000

It follows the same structure that each experiment have (see Storage for more details), and will be used to share different example of file format generated at the facility, from all instrument and detectors. These datasets are also linked to the Metadata catalog and information about the data (instrument, detector, sample, date, …) can be found there (MDC). Each run datasets comprise raw data (in .../p700000/raw/run_id) calibrated data (in .../p700000/proc/run_id) and a set of sample script to read the data (in .../p700000/usr/run_id).

List of sample data sets:

run id

description

Date

comments

r0001

instrument: SPB
detector: AGIPD
sample: Water

2018-04-03

commissioning

r0002

instrument: SPB
detector: AGIPD
sample: Lysozyme (liquid)

2018-04-03

commissioning

r0003

instrument: SPB
detector: AGIPD
sample: Lysozyme (liquid)

2018-04-03

commissioning

r0004

instrument: SPB
detector: AGIPD
sample: Lysozyme (liquid)

2018-04-03

commissioning

r0005

instrument: SPB
detector: AGIPD
sample: Lithium titanate

2018-08-18

AGIPD
calibration

r0006

instrument: SPB
detector: AGIPD
sample: Lithium titanate 1

2017-11-20

commissioning

r0007

instrument: FXE
detector: LPD
sample: aqueous solution
of [Fe(bpy)3]2+

2017-09-18

User Run

r0008

tunnel: SA1_XTD2
device: XGM
sample: n/a

2019-02-15

commissioning (XPD)

r0009

tunnel: SA3_XTD10
device: XGM
sample: n/a

2019-02-15

commissioning (XPD)

r0010

instrument: SPB
detector: AGIPD
run type: calibration - dark high gain

2019-08-10

commissioning

r0011

instrument: SPB
detector: AGIPD
run type: calibration - dark medium gain

2019-08-10

commissioning

r0012

instrument: SPB
detector: AGIPD
run type: calibration - dark low gain

2019-08-10

commissioning

r0013

instrument: SPB
detector: AGIPD
sample: Lysozyme

2019-08-11

commissioning

r0014

instrument: SPB
detector: AGIPD
sample: Lysozyme

2019-08-11

commissioning

r0015

instrument: SPB
detector: AGIPD
sample: Lysozyme

2019-08-11

commissioning

r0016

instrument: SPB
detector: AGIPD
sample: Lysozyme

2019-08-11

commissioning

r0017

instrument: SPB
detector: AGIPD
sample: Lysozyme

2019-08-11

commissioning

r0018

instrument: SPB
detector: AGIPD
sample: Lysozyme

2019-08-11

commissioning

r0019

instrument: SQS
device: digitizer
sample: Xenon

2019-10-11

commissioning

r0020

instrument: SQS
device: digitizer
sample: Xenon

2019-10-11

commissioning

r0021

instrument: SPB
detector: Jungfrau
sample: Lysozyme

2019-05-05

IRDa commissioning

r0022

instrument: SPB
detector: Jungfrau
sample: Lysozyme

2019-05-05

IRDa commissioning

r0023

instrument: SCS
detector: DSSC
sample: 2-Co8_pt14_8fold - 30nm Pt cap

2019-05-05

p002212 helicity switching

r0024

instrument: SCS
detector: DSSC
sample: 1-Co10_Pt_6fold

2019-05-05

p002212 helicity switching

r0025

instrument: SCS
detector: DSSC
sample: Ni-20 MLs - b

2019-05-05

p002212 helicity switching

r0026

instrument: SCS
detector: DSSC
sample: Ni75-11 MLs-b

2019-05-05

p002212 helicity switching

r0027

instrument: MID
detector: AGIPD
sample: Silica 50 nm

2019-09-21

commissioning

Note

Mock data can be generated using the extra_data package, e.g.:

>>> from extra_data.tests.make_examples import make_agipd_example_file
>>> make_agipd_example_file('agipd_example.h5')

>>> from extra_data.tests.make_examples import write_file, Motor, ADC, XGM
>>> write_file('test_file.h5', [
..:     XGM('SPB_XTD1_XGM/XGM/MAIN'),
..:     Motor('SPB_DET_MOT/MOTOR/AGIPD_X'),
..:     Motor('SPB_DET_MOT/MOTOR/AGIPD_Y'),
..:     Motor('SPB_DET_MOT/MOTOR/AGIPD_Z'),
..:     ADC('SA1_XTD2_MPC/ADC/1', nsample=0, channels=(
..:         'channel_3.output/data',
..:         'channel_4.output/data',
..:         'channel_5.output/data'))
..:     ], ntrains=500, chunksize=50)

This only creates the structure of the files; the data will all be zeros.

1

Lithium titanate, spinel; nanopowder, <200 nm particle size (BET), >99%; CAS Number 12031-95-7; Empirical formula Li4Ti5O12; https://www.sigmaaldrich.com/catalog/product/aldrich/702277

3.7.2. Public data from EuXFEL in the CXIDB

The following entries at https://cxidb.org/ stem from user experiments done at our facility:

CXIDB id

description

Deposition date

Publication DOI

id80

Authors: Wiedorn et al.
instrument: SPB/SFX
sample: Lysozyme
wavelength: 1.33 Å (9.30 keV)

2018-08-13

10.1038/s41467-018-06156-7

id83

Authors: Wiedorn et al.
instrument: SPB/SFX
sample: β-lactamase
wavelength: 1.33 Å (9.30 keV)

2018-08-13

10.1038/s41467-018-06156-7

id87

Authors: Grünbein et al.
instrument: SPB/SFX
sample: Urease, Concanavalin A/B
wavelength: 1.66 Å (7.47 keV)

2018-09-12

10.1038/s41597-019-0010-0

id98

Authors: Yefanov et al.
instrument: SPB/SFX
sample: Lysozyme
wavelength: 1.33 Å (9.30 keV)

2020-02-07

10.1063/1.5124387

id100

Authors: Pandey et al.
instrument: SPB/SFX
sample: Photoactive Yellow Protein
wavelength: 1.33 Å (9.30 keV)

2019-08-12

10.11577/1577287

id111

Authors: Gisriel et al.
instrument: SPB/SFX
sample: Photosystem I
wavelength: 1.33 Å (9.30 keV)

2020-11-21

10.1038/s41467-019-12955-3

id152

Authors: Echelmeier et al.
instrument: SPB/SFX
sample: KDO8PS
wavelength: 1.33 Å (9.30 keV)

2021-07-22

10.1038/s41467-020-18156-7

3.8. Downloading experiment data

Experiments at European XFEL typically generate large amounts of data - from around 10 TB up to petabytes from one beamtime. Because of this, we recommend that you analyse data on the Maxwell cluster rather than downloading it.

If you do need to download experimental data, there are two options:

  • Using Globus (see Globus’ How To). The endpoint is euxfel#euxfel, and you should use your XFEL credentials to authenticate. The metadata catalogue has links to Globus for each proposal & run.

  • Using FTP from ftp.xfel.eu. You can use this with lftp (command line), or FileZilla (GUI). TLS encryption (‘explicit FTPS’) is required. The FTP server is not considered a critical service, so it may be unavailable at times.