SPARK & PARQUET for large data volumes

Spark has become an essential tool for processing large volumes of data.

For several years now, CASD has been making Spark available to users to facilitate distributed processing on large volumes of data. The first Spark cluster was set up at CASD in 2015, bringing together 30 nodes as part of the Teralab project.

Today, most Spark projects at CASD use Python (via PySpark) or R (via Sparklyr or SparkR) and the Parquet data format. By organizing data in columns, Parquet optimizes compression, which is provided by default. In concrete terms, for a data file of 730 million rows, the storage space required depends on the :

• CSV: 417 Gigabytes
• SAS: 311 Gigabytes
• PARQUET: 24 Gigabytes

This compression can be combined with appropriate data partitioning to speed up subsequent processing, particularly when this processing is carried out on columns, while minimizing the use of disk and memory resources.