Main Image
Volume 351 - International Symposium on Grids & Clouds 2019 (ISGC2019) - Data Management & Big Data
Towards Predictive Maintenance with Machine Learning at the INFN-CNAF computing centre
L. Giommi,* D. Bonacorsi, T. Diotalevi, L. Rinaldi, L. Morganti, A. Falabella, E. Ronchieri, A. Ceccanti, B. Martelli, S. Tisbeni
*corresponding author
Full text: pdf
Published on: 2019 November 21
Abstract
The INFN-CNAF computing center, one of the Worldwide LHC Computing Grid Tier-1 sites, is serving a large set of scientific communities, in High Energy Physics and beyond. In order to increase efficiency and to remain competitive in the long run, CNAF is launching various activities aiming at implementing a global predictive maintenance solution for the site.

This requires a site-wide effort in collecting, cleaning and structuring all possibly useful data coming from log files of the various Tier-1 services and systems, as a necessary step prior to designing machine learning based approaches for predictive maintenance.

Among the Tier-1 services, efficient storage systems are one of the key ingredients of Tier-1 operations. CNAF uses the StoRM service as a Grid Storage Resource Manager solution: its operations are logged in a very complex manner, as the log content is deeply unstructured and hard to be exploited for analytics purposes. Despite such difficulty, the StoRM logs are a precious source of information for operators (e. g. real-time monitoring and anomaly detection), for developers (e. g. debugging, service stability, code improvements) and for site managers (service optimization, storage usage efficiency, time and money saving ways to spot and prevent unwanted behaviors).

Based on previous experiences on Big Data Analytics and Machine/Deep learning in the CMS experiment, this work describes how the StoRM logs can be handled and parsed to extract the relevant information, how such log handling can be designed to work automatically, how to define and implement metrics to tag critical states of the service, how to correlate StoRM events with external services events, and ultimately how to contribute to the future CNAF-wide predictive maintenance system.

Initial results in this activity are presented and discussed. Furthermore, a mention to ongoing complementary work at the CNAF center is also mentioned.
DOI: https://doi.org/10.22323/1.351.0003
Open Access
Creative Commons LicenseCopyright owned by the author(s) under the term of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.