Progress in Machine Learning Studies for the CMS Computing Infrastructure

Bonacorsi, Daniele; Kuznetsov, Valentin; Magini, Nicolo; Diotalevi, Tommaso; Repečka, Aurimas; Matonis, Žygimantas; Kančis, Kipras

doi:10.22323/1.293.0023

Abstract

Tens of Petabytes of collision and simulated data have been collected and distributed across WLCG sites in Run-1 and Run-2 at LHC. A low latency in transfers among dozens of computing centres is crucial to make an efficient use of the computing resources. Despite on average the desired level of throughput has been successfully achieved to serve the LHC physics programs, it is not uncommon to observe transfer latencies caused by a large variety of causes, from file corruptions to site issues, most of which require operator intervention. To improve on this front, in particular, the CMS experiment equipped the PhEDEx dataset replication system with a system to collect the latency data, and a mechanism to categorise and analyse them promptly, matching them to quick and focussed operators intervention. The transfer latencies data has also been the target of Machine Learning techniques - already used in CMS to study and predict the dataset popularity - and preliminary results on the work in progress in terms of predictability potential of this approach for both applications will be presented and discussed.