CERN Accelerating science

Online use of AI within ALICE

A reconstructed event of particles in the ALICE TPC, with data taken with the neural network cluster finding algorithm. Pink dots are reconstructed clusters.

 

AI applications have rapidly expanded in everyday life, and ALICE, among other high-energy physics experiments at CERN, is increasingly exploring their use in fundamental physics research. This is particularly relevant for online data taking, where large data volumes, dense collision environments, and heterogeneous computing architectures require physics-aware and computationally efficient algorithms.

Run 3 became an important basis for testing AI and machine-learning methods under realistic operational constraints, opening promising perspectives for Run 4, thanks to the recent advances in ML and LLM-based techniques.

AI-assisted cluster finding in online reconstruction

Heavy-ion collisions in ALICE create extremely track-dense environments, requiring a novel approach to cluster finding when charge overlap is high in the ALICE Time Projection Chamber (TPC). Heterogeneous computing architectures, combining CPUs and GPU acceleration, make ALICE online processing particularly well-suited to highly parallelizable algorithms. For cluster finding, two neural networks were trained to reject high-inclination clusters from low-momentum looping tracks before the tracking stage (see Fig. 1) and to improve cluster properties, such as total charge and centre of gravity.

Reducing the number of clusters yields a direct disk space saving of approximately 16% (out of 130 TB of online storage capacity) without observable losses in physics performance. This was verified through Monte Carlo efficiencies, fake rates, χ2/NDF of the track fit, invariant-mass peaks in real data, and dE/dx separation power for electrons and pions [1]. 

Fully connected networks, with four hidden layers and 32 neurons per layer, combined with compute-efficient ReLU activation functions, demonstrated the best compromise between computational resources and physics results. During the Long Shutdown 3, the event-processing nodes of the ALICE online farm will be equipped with newer hardware acceleration in newer GPU systems, allowing this algorithm to become the default TPC cluster-finding algorithm in Run 4.

≈

 

Figure 1: Low-momentum looping tracks in the TPC.

AI for real-time data-quality monitoring

During the online reconstruction, rigorous Quality Control (QC) procedures must be performed in real time to ensure that the data meet the expected standards of quality and reliability. ALICE performs real-time Data Quality Monitoring (DQM) and Quality Assurance (QA) using the Online-Offline (O2) framework, which monitors detector status and performance via rule-based observables. This is complemented by 24/7 monitoring from on-shift personnel in the ALICE Control Room.

While generally effective, the current framework has limitations. Rule-based methods can miss subtle or previously unseen anomalies, and manual inspection is, by construction, inconsistent, time-consuming, and resource-intensive. Consequently, some detector issues have historically gone undetected, resulting in a silent degradation of quality in a fraction of the data that is both permanent and irreversible.

To overcome these limitations, a semi-supervised autoencoder-based strategy was developed to evaluate the occupancy maps of the TPC. Trained solely on healthy detector data, the model learns normal occupancy patterns and identifies anomalous behaviour through increased reconstruction errors (see Fig. 2).

ALICE AI EP newsletter -Image2

Figure 2: AI-Driven Anomaly Detection on TPC Occupancy Maps

To enable multi-class classification, the autoencoder was also used as a feature-extraction backbone, with the resulting reconstruction-error maps processed by a supervised convolutional classification head.

The model achieved over 99% precision and recall on test data while distinguishing anomaly classes currently invisible to the existing framework. The framework and inference pipeline have been deployed in production to extend the QC framework during the last heavy-ion period of Run 3. The generated predictions will be available to shifters for real-time monitoring and validation on unseen detector data under operational conditions [2].

Future developments include expanding anomaly detection to a broader range of detector observables and investigating AI agents for online and offline operational support.

AI for online operations and expert support

During data-taking campaigns, ALICE performs a first data reduction and full reconstruction of a fraction of the data using the O2 software stack. The involved systems produce around TBs of raw telemetry per day, including metrics and logs from software and hardware.

In Run 3, ALICE started building the foundations for the systematic exploration of AI and machine-learning techniques in online operations. This work established the first data pipelines, observability practices, and operational datasets needed to analyse the data acquisition system at scale.

Anomaly-detection algorithms were explored to analyse rates, latencies, errors, and resource usage, showing promising results in spotting meaningful deviations from average behaviour and raising alerts accordingly. In Figure 3, an overview of anomaly detection related to the log severity of the ALICE Experiment Control system is shown. This approach aligns well with unsupervised methodologies, particularly where certain failure patterns remain unidentified or lack historical examples.

In Run 4, ALICE plans to extend and refine these techniques to automatically cross-correlate anomalies from different sources, including detector hardware, networks, software, and computing infrastructure. The objective is to group related symptoms, suppress noise, estimate severity, and indicate which subsystems are likely involved. This would provide a more context-aware alerting system and accelerate expert diagnosis.

The goal is to use AI to reduce the gap between the esoteric knowledge of complex acquisition systems and the agency space of on-call experts and shifters. By making operational context more accessible, searchable, and actionable, AI can help experts intervene faster, with broader understanding and less dependence on undocumented institutional memory.

When failures occur, automated tools could propose likely causes and relevant evidence based on correlated anomalies, producing LLM-generated reports that provide experts with a faster overview while preserving human judgment.

ALICE AI EP newsletter June 2026

Figure 3: Summary of the detected anomalies based on the log severity of the ALICE’s Experiment Control system

 

 

References

[1] C. Sonnabend, Neural network cluster finding for the ALICE TPC online GPU processing, PhD. Thesis, Heidelberg University, Heidelberg, Germany (2026).

[2] Z. Sourpi et al., Towards AI-Driven Automation for Data Quality Monitoring in ALICE, oral presentation at CHEP 2026, Bangkok, Thailand (2026).