mardi 3 mars 2020

How to tune PySpark to run on small datasets (~10k lines) for unit/integration testing purposes?

I have a huge ETL pipeline that uses Spark. For each transformation task I would like to write a simple unit-test, basically having a baked input and an expected output. This will allow me to profile and refactor my codebase without fear.

I managed to do it and it works fine, but it is really slow. Even using a very small input, Spark takes a long time to run it (45s with 16 cores, standalone spark running inside a docker container). I would be happy if I could get it under than 10 seconds with 4 cores (laptop).

The transformation task is relatively simple, an equivalent code in pandas would run in milliseconds. I managed to optimize it a little bit by reducing default.parallelism and sql.shuffle.partitions to 32 instead of 200, but I am wondering if there are configuration options or other ideas to make spark faster. Ideally I would like to force Spark to never use disk, avoid overheads related to data consistency and other irrelevant things in my scenario.

Other suggestions related to automated testing of Spark jobs are welcome, as there is scarce material online about this subject.

Aucun commentaire:

Enregistrer un commentaire