Data Quality Assurance plays an important role in all high-energy physics experiments. Currently used methods rely heavily on manual labour and human expert judgements. Hence, multiple attempts are being undertaken to develop automatic solutions especially based on machine learning techniques as the core part of Data Quality Monitoring systems.
However, anomalies caused by detector malfunctioning or sub–optimal data processing are difficult to enumerate a priori and occur rarely, making it difficult to use supervised classification. Therefore, researchers from different experiments including ALICE and CMS work extensively on semi–supervised and unsupervised algorithms in order to distinguish potential outliers without manually assigned labels.
In this contribution, we will discuss several projects whose that aim at solve this task. Machine learning based solutions bring several advantages and may provide fast and reliable data quality assurance, simultaneously reducing the manpower requirements. A good example of this approach is a model based on deep autoencoder employed in the CMS experiment which has been successfully qualified on CMS data collected during the 2016 LHC run. Tests indicate that this solution is able to detect anomalies with high accuracy and low fake rate when compared against the outcome of the manual labelling by experts.
Researchers from the ALICE experiment are currently working on a similar task. They intend to perform a data quality checks in much higher granularity. The current approach is limited to run classification based on manually set cut–offs on descriptive data statistics. More sophisticated machine learning based methods may enable more accurate data selection, on high granularity level of 15-minutes data acquisition periods.