Data Files¶
Data Policy¶
The full data policy of European XFEL is available at https://www.xfel.eu/users/policies/index_eng.html, with a short summary provided in the Overview Page.
Data Folders¶
On both the Online Cluster and the Offline Cluster, European XFEL data is stored in /gpfs/exfel/exp
. Each instrument has a directory which contains cycles, cycles contain proposals, and the proposals contain the data collected as well as locations for users to store their code and data outputs:
/gpfs/exfel/exp/
├── {INSTRUMENT}
│ ├── {CYCLE}
│ │ ├── p{PROPOSAL}
│ │ └── ...
│ └── ...
└── ...
More Information
The contents of the proposal directory is explained in:
For example, if you have an experiment at SPB, cycle 201701, with proposal number 2012, then the proposal directory will be /gpfs/exfel/exp/SPB/201701/p002012
.
The raw data from each run goes in a subfolder such as raw/r0104
. Once this has been migrated to the offline cluster, corrected detector data will be automatically produced in another subfolder such as proc/r0104
.
Reading Data in Python¶
More Information
The volume of data and the number of instruments make even something as simple as viewing a single image from a large detector non-trivial as it involves opening multiple files, each with their own internal hierarchical data structure, finding and reading the correct slices of data out of them, and then bringing it all together.
To make data access as easy as possible we provide a Python package named EXtra-data to read data from European XFEL.
We also provide a tool to create a virtual CXI file, which can be used with any tools that take CXI-style data: HDF5 Virtualise.
Combining Detector Data from Multiple Modules¶
The pixel detectors (AGIPD and LPD) record data in separate files for each of their 16 modules.
The EXtra-data Python library can combine detector modules into a numpy array, as shown in this example.
Alternatively, the modules can be combined in a single view as an HDF5 virtual dataset with the extra-data-make-virtual-cxi
command, allowing the data to be processed by external tools such as CrystFEL. Use of this tool is covered in How to Make Virtual CXI Data Files.
Geometry Files¶
Geometry files specify the location of the detector modules in real space. Please contact your instrument scientist regarding obtaining geometry files for the detector at each instrument.
EXtra-geom is a Python library used to describe the physical layout of multi-module detectors at European XFEL, and to assemble complete detector images.
One geometry file for the FXE LPD detector is available online
One geometry file for the SPB/SFX AGIPD1M detector is available from https://cxidb.org/id-83.htm (Experiment by Anton Barty).
EXtra-geom can read both file formats, see for example Assembling Detector Data into Images.
GeoAssembler can be used to create or adjust geometry files by visually moving detector quadrants around.
A more systematic provision of geometry files is in preparation.
Data Format¶
Experimental data are taken in the context of the following categories:
Context | Description |
---|---|
Instrument | Each instrument has their own label. For each instrument, there are multiple cycles |
Cycle | A scheduling period in which multiple user experiments will take place (of the order of months) |
Proposal | Each user experiment (beamtime) gets a proposal number |
Run | When a user starts acquiring data, then a new run starts, until that data acquisition is stopped |
Train id | There are 10 pulse trains per second |
Pulse id | Up to 2700 pulses per train, individually counted. Counter starts from zero for every train |
We distinguish different types of data:
Context | Description |
---|---|
Control Data | one entry for each train, even if the value changes less often than that |
Instrument Data | may have zero, one or multiple entries per train. Your main experimental results, e.g. from X-ray detectors, will usually be 2d/1d detector data |
Run data | is a superset of Control data, captured once per run |
Data is stored in HDF5 files; there may be tens to thousands of files in a single run. We aim to enable you to analyse data without needing to know the details of the file structure, e.g. by using EXtra-data in Python, or by generating a CXI file to represent a run as previously mentioned.
If you do need to read the EuXFEL HDF5 files yourself, however, the structure is described in the data files format page of the EXtra-data docs.
HDF5 Chunking & Compression¶
Both raw and corrected data may be stored using the HDF5 chunked layout. Some parts of the corrected data are compressed using the gzip compression filter in HDF5. In particular, detector gain stage and mask datasets compress well, saving a lot of disk space.
You can examine compression and chunk sizes using the GUI HDF View tool, our h5glance command line tool, or h5ls -v
:
$ h5glance /gpfs/exfel/exp/XMPL/201750/p700000/raw/r0803/RAW-R0803-AGIPD00-S00000.h5 \
INSTRUMENT/SPB_DET_AGIPD1M-1/DET/0CH0:xtdf/image/data
/gpfs/exfel/exp/XMPL/201750/p700000/raw/r0803/RAW-R0803-AGIPD00-S00000.h5/INSTRUMENT/SPB_DET_AGIPD1M-1/DET/0CH0:xtdf/image/data
dtype: uint16
shape: 16000 × 2 × 512 × 128
maxshape: Unlimited × 2 × 512 × 128
layout: Chunked
chunk: 16 × 2 × 512 × 128
compression: None (options: None)
...
$ h5ls -v /gpfs/exfel/exp/XMPL/201750/p700000/raw/r0803/RAW-R0803-AGIPD00-S00000.h5/INSTRUMENT/SPB_DET_AGIPD1M-1/DET/0CH0:xtdf/image/data
Opened "/gpfs/exfel/exp/XMPL/201750/p700000/raw/r0803/RAW-R0803-AGIPD00-S00000.h5" with sec2 driver.
data Dataset {16000/Inf, 2/2, 512/512, 128/128}
Location: 1:12333
Links: 1
Modified: 2017-11-20 04:57:44 CET
Chunks: {16, 2, 512, 128} 4194304 bytes
Storage: 4194304000 logical bytes, 4194304000 allocated bytes, 100.00% utilization
Type: native unsigned short
The compressed datasets are stored with a single detector frame per chunk, to minimise the impact on analysis code reading the data.
If you observe pathologically slow reading, check whether you are accessing a compressed dataset with a chunk size larger than one frame. HDF5 decompresses an entire chunk at once, and it may be redoing this for each frame you read. You can avoid this by setting a cache size large enough to hold one complete chunk. The necessary C code looks something like this:
hid_t dapl = H5Pcreate(H5P_DATASET_ACCESS);
// Set a 32 MB cache size (calculate at least the size of one chunk)
H5Pset_chunk_cache(dapl, H5D_CHUNK_CACHE_NSLOTS_DEFAULT, 32 * 1024 * 1024, 1);
hid_t h5_dataset_id = H5Dopen(h5_file_id, ".../image/gain", dapl);
To benefit from chunk caching, you need to reuse the opened dataset ID for successive reads, instead of opening and closing it to read each frame.
Example Data¶
Some example datasets are available so you can try reading the files before your experiment. There may be differences, e.g. in naming, when you collect new data, so it's a good idea to talk to the relevant instrument group and the data analysis group at European XFEL as well.
Example Runs on Maxwell¶
We prepared an environment to mimic real experiment data cycle at the European XFEL. For this, we have a fake instrument called XMPL
which contains runs giving an overview of the data to expect. This data is made available on Maxwell:
/gpfs/exfel/exp/XMPL/201750/p700000
It follows the same structure that each experiment have (see Offline Analysis - Offline Storage for more details), and will be used to share different example of file format generated at the facility, from all instrument and detectors.
These datasets are also linked to the Metadata catalog and information about the data (instrument, detector, sample, date, ...) can be found there (MDC). Each run datasets comprise raw data (in .../p700000/raw/run_id
) calibrated data (in .../p700000/proc/run_id
) and a set of sample script to read the data (in .../p700000/usr/run_id
).
List of sample data sets:
Run ID | Instrument | Detector/Device | Sample | Run Type | Date | Comments |
---|---|---|---|---|---|---|
r0001 | SPB | AGIPD | Water | Standard | 2018-04-03 | Commissioning |
r0002 | SPB | AGIPD | Lysozyme (liquid) | Standard | 2018-04-03 | Commissioning |
r0003 | SPB | AGIPD | Lysozyme (liquid) | Standard | 2018-04-03 | Commissioning |
r0004 | SPB | AGIPD | Lysozyme (liquid) | Standard | 2018-04-03 | Commissioning |
r0005 | SPB | AGIPD | Lithium titanate | Standard | 2018-08-18 | Geometry calibration |
r0006 | SPB | AGIPD | Lithium titanate 1 1 | Standard | 2017-11-20 | commissioning |
r0007 | FXE | LPD | Aqueous solution of [Fe(bpy)3]2+ | Standard | 2017-09-18 | User Run |
r0008 | SA1_XTD2 | XGM | N/A | Standard | 2019-02-15 | Commissioning (XPD) |
r0009 | SA3_XTD10 | XGM | N/A | Standard | 2019-02-15 | Commissioning (XPD) |
r0010 | SPB | AGIPD | N/A | Calibration - Dark high gain | 2019-08-10 | Commissioning |
r0011 | SPB | AGIPD | N/A | Calibration - Dark medium gain | 2019-08-10 | Commissioning |
r0012 | SPB | AGIPD | N/A | Calibration - Dark low gain | 2019-08-10 | Commissioning |
r0013 | SPB | AGIPD | Lysozyme | Standard | 2019-08-11 | Commissioning |
r0014 | SPB | AGIPD | Lysozyme | Standard | 2019-08-11 | Commissioning |
r0015 | SPB | AGIPD | Lysozyme | Standard | 2019-08-11 | Commissioning |
r0016 | SPB | AGIPD | Lysozyme | Standard | 2019-08-11 | Commissioning |
r0017 | SPB | AGIPD | Lysozyme | Standard | 2019-08-11 | Commissioning |
r0018 | SPB | AGIPD | Lysozyme | Standard | 2019-08-11 | Commissioning |
r0019 | SQS | Digitizer | Xenon | Standard | 2019-10-11 | Commissioning |
r0020 | SQS | Digitizer | Xenon | Standard | 2019-10-11 | Commissioning |
r0021 | SPB | Jungfrau | Lysozyme | Standard | 2019-05-05 | IRDa commissioning |
r0022 | SPB | Jungfrau | Lysozyme | Standard | 2019-05-05 | IRDa commissioning |
r0023 | SCS | DSSC | 2-Co8_pt14_8fold - 30nm Pt cap | Standard | 2019-05-05 | p002212 helicity switching |
r0024 | SCS | DSSC | 1-Co10_Pt_6fold | Standard | 2019-05-05 | p002212 helicity switching |
r0025 | SCS | DSSC | Ni-20 MLs - b | Standard | 2019-05-05 | p002212 helicity switching |
r0026 | SCS | DSSC | Ni75-11 MLs-b | Standard | 2019-05-05 | p002212 helicity switching |
r0027 | MID | AGIPD | Silica 50 nm | Standard | 2019-09-21 | Commissioning |
Note
Mock data can be generated using the extra_data package, e.g.:
>>> from extra_data.tests.make_examples import make_agipd_example_file
>>> make_agipd_example_file('agipd_example.h5')
>>> from extra_data.tests.make_examples import write_file, Motor, ADC, XGM
>>> write_file('test_file.h5', [
XGM('SPB_XTD1_XGM/XGM/MAIN'),
Motor('SPB_DET_MOT/MOTOR/AGIPD_X'),
Motor('SPB_DET_MOT/MOTOR/AGIPD_Y'),
Motor('SPB_DET_MOT/MOTOR/AGIPD_Z'),
ADC('SA1_XTD2_MPC/ADC/1', nsample=0, channels=(
'channel_3.output/data',
'channel_4.output/data',
'channel_5.output/data'))
], ntrains=500, chunksize=50)
This only creates the structure of the files; the data will all be zeros.
Public Data from EuXFEL in CXIDB¶
The following entries at https://cxidb.org/ stem from user experiments done at our facility:
CXIDB ID | Instrument | Authors | Sample | Wavelength | Deposition Date | Publication DOI |
---|---|---|---|---|---|---|
id80 | SPB/SFX | Wiedorn et al. | Lysozyme | 1.33 Å (9.30 keV) | 2018-08-13 | 10.1038/s41467-018-06156-7 |
id83 | SPB/SFX | Wiedorn et al. | β-lactamase | 1.33 Å (9.30 keV) | 2018-08-13 | 10.1038/s41467-018-06156-7 |
id87 | SPB/SFX | Grünbein et al. | Urease, Concanavalin A/B | 1.66 Å (7.47 keV) | 2018-09-12 | 10.1038/s41597-019-0010-0 |
id98 | SPB/SFX | Yefanov et al. | Lysozyme | 1.33 Å (9.30 keV) | 2020-02-07 | 10.1063/1.5124387 |
id100 | SPB/SFX | Pandey et al. | Photoactive Yellow Protein | 1.33 Å (9.30 keV) | 2019-08-12 | 10.11577/1577287 |
id111 | SPB/SFX | Gisriel et al. | Photosystem I | 1.33 Å (9.30 keV) | 2020-11-21 | 10.1038/s41467-019-12955-3 |
id152 | SPB/SFX | Echelmeier et al. | KDO8PS | 1.33 Å (9.30 keV) | 2021-07-22 | 10.1038/s41467-020-18156-7 |
Linking Publications to Data
A Digital Object Identifier (DOI) will be generated for each successful proposal. Each publication should reference the DOI of the data.
Downloading Experiment Data¶
Experiments at European XFEL typically generate large amounts of data - from around 10 TB up to petabytes from one beamtime. Because of this, we recommend that you analyse data on the Maxwell cluster rather than downloading it.
If you do need to download experimental data, there are two options:
- Using Globus (see Globus How To). To login to the EuXFEL endpoint you should use the organizational login ("Use your existing organizational login"), search for "DESY" and select "Deutsches Elektronen-Synchrotron DESY", click continue, and enter your EuXFEL credentials followed by OTP token to authenticate. To view the data click the search box for "Collection", enter "EuXFEL", and select the "EuXFEL Data Collection".
- Using FTP from
ftp.xfel.eu
. You can use this withlftp
(command line), or FileZilla (GUI). TLS encryption ('explicit FTPS') is required. The FTP server is not considered a critical service, so it may be unavailable at times.
-
Lithium titanate, spinel; nanopowder, <200 nm particle size (BET), >99%; CAS Number 12031-95-7; Empirical formula Li4Ti5O12; https://www.sigmaaldrich.com/catalog/product/aldrich/702277 ↩