Data centers house IT and physical infrastructures to support researchers in transmitting, processing and exchanging data and provide resources and services with a high level of reliability. Through the usage of infrastructure observability platforms, it is possible to access and analyse data that provide information on data center status enabling the prediction of events of interest.
During the last few years, in the context of the main data processing and computing technology research center of the Italian Institute for Nuclear Physics, we have performed a set of studies based on service log files and machine metrics to identify anomalies and define alarm signals. In the present work we aim at validating our previous studies by considering critical scenarios and extending the range and type of monitoring data. With the usage of principal component analysis, clustering techniques, and statistical anomaly detection solutions, we have been able to achieve a faster, almost real-time, detection of anomalies taking into consideration the collection of past events.
As an added value, the relationship between the identified anomalies and the threshold-risk values will be assessed and shown as a dynamic level of risks to be used for predictive maintenance management.