Event Tokenization and Next-Token Prediction for Anomaly Detection at the LHC

Visive, Ambre; Ruiz de austri, Roberto; Moskvitina, Polina; Nellist, Clara; Caron, Sascha

Abstract

Advances in Machine Learning, particularly Large Language Models (LLMs), enable more efficient interaction with complex datasets through tokenization and next- or masked-token prediction, providing a novel framework for analysing high-energy physics datasets. We explore strategies for representing particle physics data as token sequences, enabling LLM-inspired models to learn event distributions and detect anomalies in proton-proton collisions at the Large Hadron Collider (LHC). By training solely on background events, the model reconstructs expected physics processes, learning properties of the given Standard Model (SM) processes. Deviations in reconstruction scores during inference flag anomalous events, providing a data-driven approach to identify rare signatures or physics beyond the Standard Model (BSM). The method is tested using simulated LHC Run 2 ($\sqrt{s} = 13~\text{TeV}$) proton-proton collision data from the Dark Machines Collaboration, replicating ATLAS conditions, focusing on SM and BSM four-top-quark final states. These tokenization strategies enable anomaly detection and suggest a path toward foundation models for the LHC and beyond, integrating state-of-the-art ML with physics principles to advance adaptive, data-driven searches for new physics.