The CMS experiment at CERN recently announced the release of 13 TeV proton-proton collision data from 2016. Over 70 TB of collision data and 830 TB of corresponding simulations are now available through the CERN Open Data Portal, marking a significant milestone in open science. This dataset, the first substantial release of 13 TeV collisions, augments the 2015 data and simulations made public in 2021.
Alongside the collision data, CMS has released over 20,000 simulations of various physics processes, new software containers, and a new virtual machine for analysis. Notably, the data is provided in the new "NanoAOD" format, a streamlined and condensed storage format that reduces file sizes by about 95% while preserving key physics information. This format is designed for ease of use, allowing analysis without dedicated CMS software. For comprehensive preservation, a subset of the collision data is provided in an expanded NanoAOD format that includes information about particle candidates from the CMS "particle flow" algorithm.
The 2016 data released this spring, constituting about half of the total collected that year, has been instrumental in producing over 200 CMS publications. These studies span a range of topics including the nature of the Higgs boson, new and rare physics processes, precision measurements of standard model phenomena, and heavy flavor physics. This extensive dataset is now available for scientists, researchers, educators, and students worldwide to explore, offering a rich resource for advancing knowledge in particle physics.
Julie Hogan, a leader in the CMS Data Preservation and Open Access group, highlights the educational potential of the new data, expressing excitement about its use in university courses. To support users, the group will host its 5th annual Open Data Workshop in July 2024 at CERN, featuring hackathon segments to launch new projects using the data. Register now to participate!
Kati Lassila-Perini, a key figure in the CMS Data Preservation and Open Access group, emphasizes the value of creating a community of users nurtured through regular events organized by CMS. Reflecting on the journey since the policy's inception twelve years ago, Lassila-Perini notes that the interest and outcomes generated by these releases have exceeded expectations. The first publications using CMS Open Data began appearing two years after the initial release, highlighting the scientific potential and complexity of analyzing these datasets.
Lassila-Perini also points out the importance of the feedback from users of CMS Open Data. As they work through detailed analysis object data files, documentation, and example codes, they also ensure that all assets needed for a research-level use of the data have been made available. Feedback is collected at the CMS Open Data workshop and on the CERN Open Data Forum. Beyond the research use, simplified datasets and examples make CMS Open Data reusable in various educational contexts, from schools to university physics courses. Tools like Jupyter notebooks facilitate easy access and usage, promoting broader engagement with the data.
Additionally, CMS has published the COMBINE software, developed during the first LHC run for Higgs boson searches. COMBINE, available as a container image with comprehensive documentation, aids in statistical modelling and fitting of data. The statistical models, which can be very complex with hundreds of parameters, and the relevant data are now accessible in electronic format. This enables external physicists to integrate CMS measurements into their studies with greater precision. This initiative significantly enhances the usefulness of CMS data, allowing for more detailed and accurate external analyses. The release of statistical models enabled the CMS team to establish an internal protocol for collecting models for future publications. This initiative is part of the Common Analysis Tools group’s effort to harmonize and modernize data-science tools across the collaboration.
To further advance open access to science, the CMS collaboration has released the statistical model and data used for early Higgs boson measurements. This encompasses all Higgs boson searches leading to the 2012 discovery. This release includes, the full statistical model, vital for precise measurements, which details all relevant systematic effects, offering more in-depth insights than simplified summaries.
Hogan acknowledges the collaborative effort within CMS to develop innovative algorithms, data formats, and analysis tools, and invites the wider research community to engage with the data and provide feedback via the CERN Open Data forum. All CMS open data is released into the public domain under the Creative Commons CC0 waiver, accessible through the CERN Open Data portal, developed openly on GitHub by CERN's Information Technology team.
This release not only exemplifies CMS's dedication to open science but also enhances the tools and resources available to the global scientific community, driving forward our understanding of the universe. Explore the 13 TeV data and simulations on the CERN Open Data Portal.
CMS thanks CERN for their invaluable support in providing the resources and expertise necessary to build and maintain the CERN Open Data portal. This achievement would not have been possible without the dedication and tireless efforts of numerous CMS collaboration members who played a crucial role in preparing and releasing this latest batch of open data.
For more information, you can read the original announcements:
Acknowledgment: The author would like to acknowledge Julie Hogan (Bethel University) and Kati Lassila-Perini (Helsinki Institute of Physics) for their contributions. Additionally, he extends his gratitude to Clemens Lange (PSI), and Piergiulio Lenzi (Università degli Studi di Firenze & INFN) for their support and fruitful comments and to Nick Wardle (Imperial College London) for his valuable feedback.