CERN Accelerating science

ATLAS and CMS joint bootcamp for analysis preservation

Last month, 30 young graduate students and postdocs gathered at CERN to attend the first joint ATLAS+CMS analysis preservation bootcamp, organised by Sam Meehan, Clemens Lange, Lukas Heinrich (CERN), and Savannah Thais (Princeton). Over the course of three days, the workshop participants learnt through hands-on tutorials and a great team of volunteer mentors how they could make their analyses reproducible using state-of-the-art software tools.

Figure 1: Thumbs up from the participants of the first joint bootcamp on analysis preservation (Credits: Samuel Meehan).

Starting from an example analysis distributed as a zip archive, the highly motivated crowd familiarised itself with the concept of continuous integration using CERN’s GitLab installation. This platform allows everyone to run automated tests on their analysis code, from simply making sure the code compiles to performing sophisticated tests such as running the full analysis on actual data and simulation files to validate that the code changes did not break the analysis logic.

The focus of the second day were software containers: as soon as the analysis code is under version control on GitLab, the tested and compiled code can be packaged into a so-called software container image, which includes everything needed to run the analysis: code, runtime, system tools, system libraries and settings. Due to the huge amounts of data processed in high-energy physics, the actual data sets themselves are, however, usually not included. These software containers can be versioned so that one knows exactly which code was run at which time, and can be executed using high-throughput batch processing systems such as CERN’s HTCondor-based batch service. In addition, this enables the leverage of on-demand remote compute clouds with effectively unlimited computing power.

However, one does not have to stride far to be able to make use of cloud computing. REANA, a data analysis platform for reproducible research with focus on high-energy physics workloads, is primarily developed at CERN. The third day of the bootcamp therefore culminated in a tutorial on using REANA, demonstrating how cloud computing and workflow technologies can work together, effectively allowing full-scale analyses to be executed and therefore also be reproduced on the push of a button.

Figure 2: The third day of the workshop was dedicated to the REANA platform; one of the key tools for reproducible research in high-energy physics. (Credits: Samuel Meehan)

In the afternoons, the participants split into two groups to focus on the details of ATLAS- and CMS-specific analysis software and authentication methods. In addition, the ATLAS group learnt about the RECAST project whereas CMS participants were able to get in touch with the CERN Analysis Preservation platform. Thanks to the HEP Software Foundation and the IRIS-HEP project, instructors, mentors, and participants could also exchange their experiences across LHC collaborations over a dinner.

CERN has embarked on the initiative to share data collected from LHC experiments as open data to the public. The analysis preservation bootcamp made direct use of that by using CMS Open Data including a published example analysis for the morning tutorials. However, there is no explicit guarantee that any analysis code published online is actually working several years after it was originally written. Furthermore, as the amount of data collected and analysed by the LHC increases, the real "logic" of the analyses is often encoded in software – from the precise way in which collisions are selected to the statistical treatment leading to the final measurement. This points to the need of preserving and ensuring the reproducibility of any analysis.

It is important to be able to easily re-run previous analyses while even reducing the required resources (i.e. time and computing power). Moreover, ensuring the reproducibility of an analysis will allow theorists to reinterpret results of previous searches and relatively quick test new theories of a large part of parameter space; a particularly important feature as we are entering in a data-driven exploratory era in particle physics.

This workshop comes in a long series of previous initiatives both by the ATLAS and CMS Collaborations (in collaboration with CERN's IT Department and Research and Computing Sector) to train physicists on the available tools and resources. It should be noted that these efforts complement open data initiatives like CERN's Open Data portal in a vital way, as also covered last year in a focus issue of the CERN Courier.

The bootcamp also offered the opportunity to the organisers to get feedback from the community and discuss possible improvements that could make the tools/methodology easily widespread-adopted in future analysis. With the second run of the LHC finished and the upgrade for Run-3 underway, some time will pass until the four LHC experiments have collected enough new data to significantly increase the sensitivity to Beyond Standard Model phenomena or the precision of Standard Model measurements. In turn, this implies that the current analyses being finalized on the full Run-2 dataset will be the most precise studies at LHC-scale physics for a long time. It is thus crucial that the details of those analyses are fully preserved, and the bootcamp was one further step into this direction.