Computing operations at the Large Hadron Collider (LHC) at CERN rely on the Worldwide
LHC Computing Grid (WLCG) infrastructure, designed to efficiently allow storage, access, and
processing of data at the pre-exascale level. A close and detailed study of the exploited computing
systems for the LHC physics mission represents an increasingly crucial aspect in the roadmap of
High Energy Physics (HEP) towards the exascale regime.
In this context, the Compact Muon Solenoid (CMS) experiment has been collecting and storing
over the last few years a large set of heterogeneous non-collision data (e.g. meta-data about
replicas placement, transfer operations, and actual user access to physics datasets). All this data
richness is currently residing on a distributed Hadoop cluster, and it is organized so that running
fast and arbitrary queries using the Spark analytics framework is a viable approach for Big Data
mining efforts. Using a data-driven approach oriented to the analysis of this meta-data deriving
from several CMS computing services, such as DBS (Data Bookkeeping Service) and MCM
(Monte Carlo Management system), we started to focus on data storage and data access over
the WLCG infrastructure, and we drafted an embryonal software toolkit to investigate recurrent
patterns and provide indicators about physics datasets popularity. As a long-term goal, this aims
at contributing to the overall design of a predictive/adaptive system that would eventually reduce
costs and complexity of the CMS computing operations, while taking into account the stringent
requests by the physics analysts community.