How to search for new physics without knowing about it?

Nuno Filipe Castro, Miguel Crispim Romão, Rute Costa Batalha Pedro (LIP) 19th Aug 2020

BSM physics ATLAS CMS LHC experiments

BSM searches at the LHC increasingly profit from ML techniques since this approach allows more general searches that minimise the probability of missing evidence of new physics. Very recently, the ATLAS collaboration has published a novel approach [1], based on weakly unsupervised ML techniques, to search for BSM signatures in fully hadronic final states. CMS has also presented first results using a Model Unspecific Search in CMS (MUSiC): a generalized model-independent approach (see box below).

A new tool for BSM search physics at colliders based on anomaly detection using machine-learning techniques.

While the Standard Model of Particle Physics (SM) has been extremely successful in describing the experimental data accumulated so far, a significant number of open questions remains and thus the search for new phenomena is a key aspect of the physics programme of present and future colliders. Given the practical difficulty of performing dedicated searches for all possible models and event topologies, inclusive searches and model independent approaches are popular strategies to find a compromise between sensitivity and model independence of the experimental analyses. Nonetheless, there is always the concern that a possible signal beyond the SM (BSM) is missed simply because the adopted strategy is not sensitive to it.

The experimental collaborations have, since ever, tried to find generic approaches for searches in order not to leave any stone unturned when searching for new phenomena beyond the SM. Comprehensive generic searches were regularly reported since the beginning of the LHC and, more recently, machine learning (ML) has become very popular in an attempt to balance sensitivity and generality.

In our work, posted in arXiv earlier this month [2], we explore anomaly detection methods (AD) as a tool to search for new physics phenomena at colliders. The AD approach relies on identifying abnormal events in a data sample consisting, in the majority or completely, of normal events belonging to the same class. The problem is usually addressed by unsupervised learning with classical shallow algorithms running to identify the outlier events. In deep learning, Artificial Neural Networks such as autoencoders (AE) have found their use as anomaly detectors since the error on the reconstruction of the inputs given by a model trained exclusively on normal events can be interpreted as an anomaly score. A known drawback of typical shallow methods, such as One-Class Support Vector Machines (OC-SVM), is the failure for high-dimensional data with many entries. This leads to a need for substantial feature engineering and dimensionality reduction before their application. On the other hand, the deep learning architecture of the AE family deals well with high-dimensional data and performs in anomaly detection despite not being trained specifically for discerning outlier events in the data.

In our paper, we present three new unsupervised ML models for AD in the context of HEP collisions, in addition to an AE. In order to test their sensitivity to different BSM signals, a few BSM classes of events are used as benchmarks to access the performance of the proposed approach by comparing it with supervised deep neural network (DNN) classifiers trained on the same benchmarks. In this way, we compare the performance of the AD methods to supervised DNNs. As such, we provide for the first time a comparison of different unsupervised AD methods in searches for new physics.

Figure 1: Two-dimensional distribution of the anomaly scores for the different AD methods per SM class of processes. In the diagonal the distributions of the anomaly score per SM process are shown.

We observe a better performance of the deep models - AE and deep support vector data description (Deep SVDD) - when compared to the shallow ones - histogram-based outlier detection (HBOS) and isolation forest (iForest). It should be noted that very weak correlations were observed between the outputs of the different methods, which points to some complementarity between them. Furthermore, the deep AE has a sensitivity similar to supervised DNN for BSM signals with vector-like quarks. The deep support vector data description produced similar discriminant power for all BSM signals, including those more similar to the SM events. This result suggests that different AD algorithms are suitable to isolate different types of BSM events and are complementary to each other in unsupervised generic searches for new phenomena in colliders.

Figure 2: Distribution of the deep autoencoder output (gray shaded area). For comparison, the distributions of the expected output of the method for different BSM signals (coloured lines) are also shown.

The presented results, even if based on fast simulation and thus still to be confirmed with detailed simulation and experimental data, show that these unsupervised AD algorithms are reasonably sensitive to new signals, with a maximum degradation relative to the supervised DNN of around an order of magnitude on the upper limits on the signal strength, for the worst cases, and no significant impact for the best ones. Interestingly, in previous work where DNN trained on different models were used to discriminate between the background and other signals [3], we observed similar trends when training deep neural networks on signals different from those used for the classification.

CMS tunes in searches for new physics

Several theories of physics beyond the standard model (BSM) have been developed to address the inadequacies of the SM, and a wide range of parameter and phase space regions of such theoretical models are accessible for direct search for the first time at the LHC. A large number of searches for a range of BSM signatures have been conducted by the LHC experiments.

Dedicated searches targeting specific BSM theories are often restricted in their scope to a few final states that are sensitive to the particular models probed. Practical constraints on the number of such analyses mean that there are models and experimental final states that remain unexplored, where BSM signatures could possibly be hidden. Furthermore, new phenomena may exist that are not described by any of the existing models. Hence, complementary to the existing searches for specific BSM scenarios, a generalized model-independent approach is employed in CMS, that is a Model Unspecific Search in CMS: MUSiC.

MUSiC starts by counting which (known) high energy particles are produced in one collision event, for example, muons or high energy photons, or bundles of hadrons, which are called jets. Fig. 1 shows a CMS event display resulting from a single proton-proton collision. In this collision, two muons can be observed, shown as long red lines originating in the centre, at the proton-proton collision point. The muons traverse several muon detectors, the outer red boxes in the image. The green and blue symbols indicate low energy particles, which are not considered. This approach is appropriate in high energy collisions, where likely signatures originate from decays of new undiscovered particles, and those are mostly predicted to be massive. The collision shown in Figure 1 is called a 'two muon' event as it contains two muons. Similarly, it is possible to define hundreds of classes, like 'two electrons plus one photon' or 'one muon plus three jets'. The number of possible combinations is huge, any number of muons, electrons, photons, or jets, can be considered, and this is why the MUSiC method is very complementary to the searches for specific new particles that are also part of the CMS programme.

The first step in the analysis is to compare the number of events in each class to the Standard Model prediction. This comparison is shown in graphical form in Figure 2, for some of the particle combinations, in this case, those with at least two electrons plus other particles. The colours indicate the different Standard Model contributions, for example in yellow, top pair production, with subsequent decay, yielding two electrons. The agreement between measurement and theory is excellent within the uncertainties! This procedure is repeated by adding other objects, for example, jets that can also come from b quarks. In each of the selections, the event yield is predicted by the Standard Model and compared to the data. As the predictions agree within the uncertainties, there is no hint for new physics.

Figure 1: The number of expected and observed events in collisions with two electrons with- or without- additional jets, including the possibility that the jets come from b quarks. The multi-coloured distribution contains all Standard Model production mechanisms that can create two muons. The data (black points) agree with the Standard Model prediction, including its uncertainties.

But counting events with leptons and jets is only looking at the surface; it is possible to also look at all these event classes in detail, by studying important kinematical distributions. To do so, the energy and flight direction for each of the identified objects are measured. An example of one of the kinematical variables is the total invariant mass, calculated from all the particles in one event. This invariant mass variable is a measure of how energetic the collision was. Figure 3 shows the mass distribution for one event class, 'two muons'. The event in Figure 1 is part of this sample. When investigating the invariant mass distribution, the overall agreement between theory (histogram) and data (points) is excellent. To confirm, a search algorithm automatically looks for inconsistencies in these distributions with the most significant deviation between CMS data and theoretical prediction. This 'region of interest' is marked by two vertical dashed red lines. Note that adding new physics to the Standard Model can make the predicted number of events go either up or down. In this case, the Standard Model predicts slightly more events there than the data shows, as can best be seen in the ratio between data and Standard Model Monte Carlo prediction at the bottom of Figure 2. Further statistical analysis also reveals that this most substantial difference is still not statistically significant.

Figure 2 The two-muon invariant mass distribution created observed in the 2016 CMS dataset. The multi-coloured distribution contains all Standard Model production mechanisms that can create two muons. The data (black points) agrees with the Standard Model prediction. The bottom part of the plot shows the data divided by the Standard Model prediction. In the area between the red lines, the data is slightly smaller than the prediction by the Standard Model, but not in a significant way.

In this manner, the CMS collaboration has investigated numerous distributions from the 2016 dataset with the MUSiC method. In some cases, the difference between theory and experiment is a bit bigger than in Figure 2. Still, altogether the MUSiC analysis found no substantial deviations, and so far CMS physicists do not see a clear signal of new physics.

This new result uses only the data collected in 2016; there is a lot more data available from the LHC Run 2 that ran up to 2018. In the next years, the LHC will produce many more proton-proton collisions, and the MUSiC algorithm is ready to find whatever nature will provide.

Further Reading:

"MUSiC, a model unspecific search for new physics, in pp collisions at sqrt(s)=13 TeV" (CMS Physics Analysis S

CERN Accelerating science

How to search for new physics without knowing about it?

Further Reading

CMS tunes in searches for new physics