CERN Accelerating science

LHC experiment release data in CERN's Open Data Portal

On 20 November 2014, CERN launched the Open Data Portal, making data from real collision events at the LHC experiments available for the first time to the general public. This project is part of the Organisation's policy of openness, which is enshrined in its founding convention and has contributed to the creation of the open internet, the development of open source, and the dissemination of open access publications. In this framework, the LHC collaborations recently approved Open Data policies and will release collision data over the coming years.

"Data from the LHC programme are among the most precious assets of the LHC experiments, that today we start sharing openly with the world. We hope these open data will support and inspire the global research community, including students and citizen scientists," said Rolf Heuer, CERN Director-General.

The purpose of the ODP is to publish and archive data obtained by the CERN experiments, making them available to everybody for further analysis or use as educational material. Its development, as most projects at CERN, was a collaborative process, requiring the concentrated efforts and hard work of digital library experts, data curators, meta data experts, researchers, and outreach teams from the four LHC experiments. The portal is using a number of different technologies to distribute and give access to the data, namely INVENIO, CernVM and EOS.

The Invenio Digital Library software enables users to run their own digital library on the web, offering a valuable tool in digital library management. CernVM is a baseline virtual machine already used by the LHC experiments enabling users around the world to develop and run LHC data analysis locally on institutional and commercial computer clouds. Finally, ESO is a disk-based service that provides a low latency storage infrastructure for physics users. Based on these technologies, the new ODP assigns digital object identifiers to the data sets and code, with the aim of organising the content effectively. In addition, besides event data sets, users will be able to find open source software to analyse the data provided.

All the four LHC experiments participate in the ODP making public release of a number of datasets customized for demonstration and educational purposes. The experiments worked closely with DPHEP, and the CERN IT/GS for:

- Implementing a common approach and generic solutions for data preservation and open access

- Using the same data preservation principles and experiment policy guidelines

- Open access portal, common analysis preservation framework, use of virtualization technology

It should be noted that all four LHC experiments have approved data preservation and access policies which state that they will make a part of their data available to the public, excepting the raw data (which is anyway not available for direct access by the collaboration members themselves).

ALICE is currently releasing about 8TB data of the reconstructed events corresponding to 2010 proton-proton and lead-lead data that are currently being staged and indexed on the CERN data preservation portal. The analysis tools available on the portal allow only performing basic transverse momentum and pseudorapidity distribution plots, but more advanced analysis will be available in future releases.  A set of outreach and educational analysis exercises have been made available at the portal. They are based on specifically selected ALICE data, are widely used for the particle physics masterclasses, and come in the form of analysis packages and small datasets organised as root files. Although the tools are simplified, the users will get the feel of the real tools employed by physicists for data analysis. Each analysis downloads on demand the required software and data from a common graphics interface. These exercises highlight some of the ALICE physics: one is about the search for particles containing strange quarks, based on their V0 decays; the motivation is to give an insight to how strangeness enhancement, one of the first signatures for the Quark-Gluon Plasma, is observed. Another exercise examines charged particle tracks; the aim is to calculate the nuclear modification factor by comparing particle yields in the case of lead-lead and proton-proton collisions; the fact that this is less than one indicates suppression of charged particles due to interactions of partons with the QGP.

ATLAS has also released a large number of data. Based on those data last year they organized the Higgs Machine Learning Challenge. The Challenge, which ran from May to September 2014, was to develop an algorithm that improved the detection of the Higgs boson signal. The specific sample used simulated Higgs particles into two tau particles inside the ATLAS detector.  Participants applied and developed cutting edge Machine Learning techniques, which have been shown to be better than existing traditional high-energy physics tools. The dataset is currently housed in the CERN Open Data Portal where it will be available permanently for educational and outreach purpose. The 60MB zipped ASCII file can be decoded without using special software, and a few scripts are provided to help users get started. Detailed documentation for physicists and data scientists is also available. Thanks to the Digital Object Identifiers (DOIs) in CERN Open Data Portal, the dataset and accompanying material can be cited like any other paper.

CMS has released, simultaneously with the release of the CERN Open Data Portal, a first large set of reconstructed data for public use. This dataset is 27 TB of proton-proton collision data at 7 TeV recorded during 2010, including fourteen primary datasets in AOD format. This is the first public release of such high-level data in high-energy physics. In addition to the primary datasets, CMS provides some examples of further reprocessed data derived from the primary datasets. These derived datasets are meant to be used with CMS analysis software (where reprocessing reduced the time needed for the final analysis) or with online web applications (where reprocessing reduced the complexity of the data in terms of content and format). Several examples of data usage can be reported, ranging from dedicated physics studies and usage of data for purely “data science” to outreach and education activities, although it is yet early to present final results of these projects.

High-school students analysing CMS open data as part of the Physics Masterclasses.

LHCb data archived in the Open Data portal consists of a sample of about 60 thousand events collected during 2011 data taking. The events contain D0 mesons decaying into kaons and pions and can be visualized and analyzed via dedicated software based on ROOT framework and developed for the International Masterclass program. The software comes via a virtual machine image and allows first to visualize the event and interactively select the D0 decay, and then to measure the lifetime of the D0 meson, as in a real physics analysis.

The LHC experiments plan to release more collision data and analysis tools over the coming years to increase the scope and uses of the ODP. This open access initiative is only the beginning of a large scale effort to be pursued even beyond the experiments’ lifetime.

 

The author would like to thank: Mihaela Gheata (ALICE), Roger Jones (ATLAS), Katri Lassila-Perini (CMS) and Silvio Amerio (LHCb) for their contributions to this article