testing: Not waterfall testing of Big Data platform

jeudi 3 septembre 2020

Not waterfall testing of Big Data platform

For simplicity reasons lets speak about ETL only. Lets also discuss dimensional modeling (star schema) because it's understood by anyone and has clearly-expressed waterfall testing incline.

The issue

Agile Sprint SDLC testing is good for set of small independent features. But with star schema you may find that many of your tables depends on each other because everything is joined with everything (maybe I'm exaggerating, but it's close to the truth). Nobody tries to make deep data analysis upfront. As result, you have to somehow explain to business that now after everything was implemented and closed in JIRA, team'll work two-three months on bugs you don't even know about.

Example

Suppose your "Some Business Report" require 10 dimensions and 2 facts to be implemented. Your team(s) implement them all and test separately during half a year. But when it comes to put everything together, precision of data is awful. Sometimes you have to spent months to fix the output precision.

Questions

What are the best practices for finding data-related-edge-cases upfront? Is it worth to?

Have you ever worked on proactive approach (systematicly making BA/DEV/QAs analyse data upfront) instead of reactive (debug the issue when it appears)?

Additional considerations (Optional)

Standard SDLC looks like BA requirements => Dev unit testing => QA end-to-end testing. But it doesn't work well, because:

BAs don't like to think in term of edge-cases (and often doesn't have time). BAs usually don't have enough will to take QAs responsibilities.
Developer doesn't have enough data knowledge and time to make deep edge-cases analysis. They work with manually created input/expectations.
QA sometimes doesn't have enough data knowledge. Even if they have - approach is waterfall-like (see "The issue" section).

Mitigations I see:

Push BA team to do requirements review, document data properties on column level, provide some validation rules. Still it won't be very effective, IMHO.
In case you are overwriting the existing system and logic remains the same, you may compare replace inputs (i.e. dimensions) from old system and try to reconcile outputs step-by-step. But it a rare case of re-implementing system w/o logic & data model change, usually you can compare final results, but you can't replace inputs - still waterfall.
Property-based testing seems to not help in complex issues that appears after multiple joins

testing