Substantial improvements to the current experiments at the LHC are underway, and new experiments are being proposed or discussed at future energy-frontier accelerators to answer fundamental questions in particle physics. At future hadron colliders, complex silicon vertex trackers (3D and 4D) and highly granular calorimeters must operate in an unprecedentedly challenging experimental environment; moreover, the real-time event selection will pose even greater challenges.
Experimental computing infrastructure used to rely on the industry to deliver an exponential increase in processor performance per unit cost over time. The main contribution to the gain in microprocessor performance at this stage came mainly from the increasing clock frequency along other improvements in computer architecture.
Applications' performance doubled every 18 months without having to redesign the software or changing the source code. To keep on this trend, the size of transistors had to be halved every 18 months. However, in the early 2000s, the layer of silicon dioxide insulating the transistor's gate from the channels through which current flows was just five atoms thick and could not be shrunk anymore. The evolution of processors hence changed towards a trend of an increasingly higher number of independent and parallel processing units. Today, scaling performance with processors' generations can be achieved only via application-level parallelism and by exploiting dedicated architectures specifically designed for particular tasks.
Heterogeneous Computing System: a host, usually coming with multiple CPU cores and its memory is connected through a bus to one or more accelerator devices, each with its own memory.
As an example, in 2008, the CERN EP department started a dedicated R&D programme because failing to adapt would have implied severe consequences for the long-term evolution of the LHC programme and future initiatives determined by unsustainable costs for software and computing [R&DMulticore].
Heterogeneous computing is the strategy of deploying multiple types of processing elements within a single workflow and allowing each to perform the tasks to which it is best suited. This approach extends the scope of conventional microprocessor architectures, taking advantage of their flexibility to run serial algorithms and control flow structures, while leveraging specialized processors to accelerate the most complex operations hundreds of times faster than what general-purpose processors can achieve.
There exist accelerators dedicated to random number generation, compression and decompression, encryption and decryption, matching of regular expressions, decoding of video and audio streams. The accelerator that, more than any other has become ubiquitous in the panorama of High-Performance Computing and Industry is the Graphics Processing Unit (GPU).
Traditional CPU and GPU architectures are based on very different design philosophies. CPUs have a latency-oriented design, with a small number of very flexible arithmetic logic units can provide the user with a result of a flow of execution in a small amount of time.
On the other hand, the design of GPUs has been shaped in the years by the video game industry. GPUs have vector processing capabilities that enable them to perform parallel operations on very large sets of data and to do it with much lower power consumption relative to the serial processing of similar data sets on CPUs. For this reason, GPU design is referred to as throughput-oriented.
Heterogeneous Computing at CERN
Today the experiments' online and offline computing infrastructures are facing the following challenges:
reducing power consumption and cooling costs;
reconstructing, simulating and managing ever-expanding volumes of data;
being able to exploit efficiently national Supercomputing resources, in which up to 95% of the processing power is coming from GPUs
Heterogeneous computing could help in addressing these challenges, but before adopting this paradigm shift some preparations are required.
Today, reconstruction and simulation core algorithms are typically written using C++, for improved performance, and configured using Python, for flexibility. All the core algorithms are plugged together in large frameworks that schedule their execution based on satisfied data dependencies. Some of these frameworks are able to schedule algorithms in parallel, in order to maximize hardware resources utilization and event throughput.
Since many applications include both algorithms that could benefit from acceleration and code that is better suited for conventional processing, no one type of processor is best for all computations: heterogeneous processing allows exploiting the best processor type for each operation within a given application, provided that the underlying software framework is able to support and schedule them.
In order to program GPUs in C++, a variety of libraries has been developed in the last decade: e.g. CUDA, HIP, OpenCL, SYCL. In order to harness the maximum throughput that such a device has to offer, often algorithms and data structures have to be redesigned by e.g. reducing the number of branches in algorithms or moving from Array-of-Structures to Structures-of-Arrays data formats. On the other hand, once a program has been implemented to execute on massively parallel architectures, reusing the newly designed data structures and algorithms on traditional CPUs could potentially give performance benefits. Making software able to execute on GPUs and CPUs with reasonable performance is many times a requirement, especially if this software runs on the WLCG, in which some machines might have a GPU installed while others might not. For this reason, R&Ds are ongoing for using performance portability libraries such as Alpaka [Alpaka], Kokkos [Kokkos] and OneAPI [OneAPI]. Provided that the starting code exposes parallelism, these libraries provide an interface that hides the back-end implementation allowing the same source code to be compiled for multiple architectures and execute with good performance.
Experiments’ trigger farms are isolated and controlled environments. This feature makes them a fertile ground for software optimization and adoption of new technologies in order to retain better physics selection in an environment with latency and throughput constraints.
ALICE was operating a prototype for GPU tracking in the High Level Trigger during LHC Run 1.In LHC Run 2 it was performing the full tracking of the Time Projection Chamber (TPC) on GPUs in the High Level Trigger Farm at around 1000 Hz of Pb-Pb collisions. For LHC Run 3, the increased collision rate (50 kHz) requires a novel approach: instead of selecting data with triggering techinquesthe full raw data are processed in the online computing farm in software.
ALICE will record minimum bias Pb-Pb collisions in continuous read out, corresponding to 3.5 TByte/s of raw data entering the computing farm. This increase in event rate cannot be handled by simple scaling of the online computing farm using traditional approaches.
Because of its GPU experience from Run 1 and 2, ALICE considered GPUs for the backbone of online data processing in Run 3 [ALICE Online Offline computing TDR]. The dominant part of the real time processing will be tracking and data compression of the TPC, which will fully run on GPUs, leveraging the experience from their experience. On top of this baseline scenario, ALICE plans to employ the GPUs of the computing farm also for the asynchronous reconstruction when there is no beam in the LHC. The asynchronous reconstruction produces the final, calibrated reconstruction output and runs many more algorithms. ALICE is working to identify the computational hot spots and port promising candidates onto GPUs to use their farm in the optimal way across the different modes of operations and workflows. The ALICE GPU code is written in a generic way: a common source code targets traditional processors as well as GPUs of different vendors using different APIs such as OpenCL, CUDA, and HIP.
ATLAS collaboration is actively investigating ways of making maximal use of new/heterogeneous resources in computing. In the past year much of the work went into evaluating all existing and upcoming possibilities for writing experimental software that could run on non-CPU back-ends. Many significant developments happened in this area since the last of the big tests that ATLAS made in its trigger project with GPUs back during LHC's Long Shutdown 1. With investigations continuing with looking at various ways of writing accelerated software, in a collaboration between trigger and offline software developers, ATLAS will revive the codes written during LS1, now using the latest programming techniques, to test how the latest/greatest hardware handles the calculations that needed for quickly reconstructing ATLAS data.
CMS CERN Team, set up in 2016 the Patatrack software R&D incubator, in order to create a fertile ground for disruptive ideas and maximize the impact that young scientists can have on the experiment physics reach. Since the day-0, the Patatrack incubator has worked in tight collaboration with CERN openlab, CERN Idea Square, industrial partners and universities.
Patatrack tries to help new ideas to go beyond the proof of concept stage and enter in production. Most of the ideas are explored during hackathons, which are held three times per year. This creates unique opportunities for scientists with different backgrounds and domain knowledge to work together, understand each other's problems and converge quickly to the best possible solution.
Group photo of the 7th Patatrack Hackathon held at CERN in September 2019.
The main R&D line that the Patatrack team has pursued is the heterogeneous online reconstruction of charged particle trajectories in the Pixel Detector starting from the LHC Run-3. In 2018, the Patatrack team demonstrated that it is possible to exploit efficiently GPUs from within the CMSSW, the offline and online CMS reconstruction software. New algorithms were developed making it possible to show that a small GPU like an NVIDIA T4 can produce almost the same event throughput as two full HLT nodes (dual socket Intel Xeon Gold 6130), at a fraction of the cost of a single node, while producing equal or better physics performance. The choice of this kind of low power and low profile GPUs makes it possible to deploy them on existing nodes as well as on newly acquired machines. Through collaborations with the CERN IT department and CERN openlab, the Patatrack team is investigating the exploitation of High-Performance Computing resources and participating in the definition of benchmarks for easing the procurement of heterogeneous hardware.
During 2020, the CMS experiment will continue its investigation on performance portability strategies, for avoiding code duplication and making the code easier to maintain, test and validate.
LHCb activities in heterogeneous computing have been concentrating on the developments of the all-software software trigger for data-taking in Run3 and beyond. The upgraded LHCb experiment will use a triggerless readout system collecting data at an event rate of 30 MHz. A software-only High Level Trigger will enable unprecedented flexibility for trigger selections. During the first stage (HLT1), a subset of the full offline track reconstruction for charged particles is run to select particles of interest based on single or two-track selections. Track reconstruction at 30 MHz represents a significant computing challenge, requiring an evaluation of the most suitable hardware to be used as well as algorithms optimized for this hardware.
In this context, the Allen R&D project  started in 2018 to explore the approach of executing the full HLT1 on GPUs. This includes decoding the raw data, clustering of hits, pattern recognition, as well as track fitting, ghost track rejection with machine learning techniques and finally event selections. Algorithms optimized for many-core architectures were developed and integrated in a compact, modular and scalable framework. Both the physics performance and event throughput of the entire HTL1 application running on GPUs are adequate such that this architecture is being considered as an option alternative to the baseline architecture for Run3, running on x86 processors. Integration tests are currently ongoing  to further validate this approach.
In the same context of software trigger for the LHCb upgrade, R&D studies are being performed towards fast pre-processing of data on dedicated FPGAs, namely in producing in real-time sorted collections of hits in the VELO detector . These pre-processed data can then be used as seeds by the High-Level Trigger (HLT) farm to find tracks for the Level 1 trigger with much lower computational effort than possible by starting from the raw detector data, thus freeing an important fraction of the power of the CPU farm for higher-level processing tasks. While the full VELO tracking on FPGA systems, based on the extremely parallelized Retina algorithm, is considered not to be ready yet for Run3, the clustering of VELO hits on FPGAs is close to being adopted as a baseline for data taking in Run3.
At experiments looking for ultra-rare events, like NA62, event selection at trigger level is of paramount importance. A project to employ GPUs at the first level of the event selection started to study the feasibility of an approach based on high-level programming in places where usually embedded software and hardware is employed for latency reasons.
The NA62 RICH detector is important to define the Time of Arrival of particles. The real-time reconstruction of Cherenkov rings with GPUs has been the first demonstrator of this heterogeneous pipeline. The event rate is about 10MHz and the maximum allowed latency at this level of the trigger is 1ms. The demonstrator was successful thanks to the development of a Network Interface Card, NaNet-10, that enables direct data transfer (RDMA) from the acquisition buffers to the GPU memory, hence decreasing the latency for data transmission. NaNet-10 uses a 10Gbit link together with a PCI Express 3 connection to the host machine. During the 2018 data taking the heterogeneous system featuring an NVIDIA P100 has processed event data coming from four RICH readout boards (TEL62). It has achieved a maximum latency of 260 us, with an average latency per event of 130ns. NA62 will employ this heterogeneous system in production for leptonic triggers during the data taking after this long shutdown.
All these efforts demonstrate the recognition of heterogeneous computing paradigm allowing to increase the physics reach of experiments by improving the trigger efficiency and to harness computing facilities of High-Performance Computing centers and industry across the world. Heterogeneous computing calls for a paradigm shift driven by the challenges of the High-Luminosity LHC and future experiments to help maximize scientific returns.
[ALICE Online Offline computing TDR for Run 3]: https://cds.cern.ch/record/2011297
[TEL62 and NA62 Trigger] Nucl.Instrum.Meth. A929 (2019) 1-22
[Nanet] J.Phys.Conf.Ser. 1085 (2018) TWEPP2018 (2019) 118 10