Using Apache Spark, I was wondering if it is really valuable producing test and in case at which level.
Reading Spark: The Definitive Guide they suggest:
The business logic in your pipelines will likely change as well as the input data. Even more importantly, you want to be sure that what you’re deducing from the raw data is what you actually think that you’re deducing. This means that you’ll need to do robust logical testing with realistic data to ensure that you’re actually getting what you want out of it.
Which suggests to introduce some sort of testing.
But what impress me is:
One thing to be wary of here is trying to write a bunch of “Spark Unit Tests” that just test Spark’s functionality. You don’t want to be doing that; instead, you want to be testing your business logic and ensuring that the complex business pipeline that you set up is actually doing what you think it should be doing.
Which outlines that Unit Testing is discouraged by the authors of this book (correct me if I misinterpreted).
What probably it is worth to test instead is the logic of the Data Transformation applied through Spark.
Again from the book:
First, you might maintain a scratch space, such as an interactive notebook or some equivalent thereof, and then as you build key components and algorithms, you move them to a more permanent location like a library or package. The notebook experience is one that we often recommend (and are using to write this book) because of its simplicity in experimentation
Which suggests to test your Data Transformation logic in an interactive environment such as Notebooks (e.g. Jupyter Notebooks for Pyspark). Basically you directly see what the transformations produce.
So I am asking to people with more experience than me, do you agree with the cited points from the book? (or am I misinterpreting) Can they be used as a sort of Best Practices in this domain? (for example avoiding Unit Tests, promoting instead higher level testing like Logic / Integration Tests)
Aucun commentaire:
Enregistrer un commentaire