CosmoHub and SciPIC: Massive cosmological data analysis, distribution and generation using a Big Data platform

Carretero, Jorge; Tallada, Pau; Casals, Jordi; Caubet, Marc; Castander, Francisco; Blot, Linda; Alarcón, Alex; Serrano, Santiago; Fosalba, Pablo; Acosta-Silva, Carles; Tonello, Nadia; Torradeflot, Fra``ncesc; Eriksen, Martin; Neissner, Christian; Delfino, Manuel

doi:10.22323/1.314.0488

Abstract

Galaxy surveys require support from massive datasets in order to achieve precise estimations of cosmological parameters. The CosmoHub platform (https://cosmohub.pic.es), a web portal to perform interactive analysis of massive cosmological data, and the SciPIC pipeline have been developed at the Port d'Informaci\'o Científica (PIC) to provide this support, achieving nearly interactive performance in the processing of multi-terabyte datasets. Cosmology projects currently supported include European Space Agency Euclid space mission, the Dark Energy Survey (DES), the Physics of the Accelerating Universe (PAU) survey and the Marenostrum Institut de Ciències de l'Espai Simulations (MICE). Support for additional projects can be added as needed. CosmoHub enables users to interactively explore and distribute data without any SQL knowledge. It is built on top of Apache Hive, part of the Apache Hadoop ecosystem, which facilitates reading, writing, and managing large datasets. More than 50 billion objects, from public and private data, as well as observed and simulated data, are available. Over 500 registered scientists have produced about 2000 custom catalogs occupying 10TiB in compressed format over the last three years. All those datasets can be interactively explored using an integrated visualization tool. The current implementation allows an interactive analysis of 1.1 billion object datasets to complete in 45 seconds. The SciPIC scientific pipeline has been developed to efficiently generate mock galaxy catalogs using as input a dark matter halo population. It runs on top of the Hadoop platform using Apache Spark, which is an open-source cluster-computing framework. The pipeline is currently being calibrated to populate the full sky Flagship dark matter halo catalog produced by the University of Zürich, which contains about 44 billion dark matter haloes in a box size of 3.78 Gpc/h. The resulting mock galaxy catalog is directly stored in the CosmoHub platform.