The requirement for an effective handling and management of heterogeneous and possibly confidential data continuously increases within multiple scientific domains.
PLANET (Pollution Lake ANalysis for Effective Therapy) is a INFN-funded research initiative aiming to implement an observational study to assess a possible statistical association between environmental pollution and Covid19 infection, symptoms and course. PLANET is built on a "data-centric" based approach that takes into account clinical components, environmental and pollution conditions, complementing primary data and many eventual confounding factors such as population density, commuter density, socio-economic metrics and more. Besides the scientific one, the main technical challenge of the project is about collecting, indexing, storing and managing many types of datasets while guaranteeing FAIRness as well as adherence to the prescribed regulatory frameworks, such as those granted by the General Data Protection Regulation, GDPR.
In this contribution we describe the developed open-source DataLake platform, detailing its key features: the event-based storage system provided by MinIO, which allows automatic metadata processing; the data-ingestion pipeline implemented via Argo Workflows; the GraphQL interface to query object metadata; finally, the seamless integration of the platform within a compute multi-user environment, showing how all these frameworks are integrated in the Enhanced PrIvacy and Compliance (EPIC) Cloud partition of the INFN Cloud federation.