stddev smoketest on product metrics + small changes#31
stddev smoketest on product metrics + small changes#31andrew-sha wants to merge 7 commits intomainfrom
Conversation
thecartercodes
left a comment
There was a problem hiding this comment.
I think from a design standpoint, I started to have some ideas.
Let's actually go ahead and calculate the stddev, 30d lookbacks in the model itself i.e. product_timeseries_metrics. Let's also calculate the number of listings above or below the stddev on the calendar_date. Then we can configure a native test in the .yml with severity conditions e.g.
https://docs.getdbt.com/reference/resource-configs/severity
This would solve for the issue you mentioned earlier where you can't partition by the combination of groups but what's more, we can then make these test assertions without windowed clauses in tests and these useful statistics as part of our analytics which I was interested in releasing anyways.
To solve your issue with more of a hack, you can just change the param partition_column_name to partition_clause and generate a liquid expression in the .yml config that concatenates the columns together delimited by , .
Im not sure how a severity test would work in this case. A severity test allows you to grade the severity of the failure based on the number of failed rows--i.e. if five rows fail the test you can throw a warning but then if more than 5 fail than the entire test fails. Importantly, I think you still need to define a custom test which will retrieve those failing rows in the first place. If you want each row in |
Yeah, it would be a custom test with a severity condition. |
|
now |
|
It'll be useful in a test. To your point though, this word "test" is probably better substituted with "audit". Specifically, the audit should be written to check for the % of outliers relative to total product listings which should also be in the metrics table. Importantly, line 46-50 in |
| , plh.price | ||
| , abs(price - avg(price) over (partition by acd.val, product_id, region_code, census_division_id, currency, unit)) as residue_cd | ||
| , abs(price - avg(price) over (partition by acd.val, product_id, region_code, currency, unit)) as residue_re | ||
| , abs(price - avg(price) over (partition by acd.val, product_id, currency, unit)) as residue_ca |
There was a problem hiding this comment.
What's the reason for directly calculating the residuals here?
There was a problem hiding this comment.
The residuals have to be calculated using those window, but then the average prices need to be calculated using grouping sets. My understanding is that using windows and grouping sets in a singe query requires a bit of finesse so I think it was the simplest logically speaking to have the windows in a subquery that the next subquery can easily refer to via those cases. I suspect there are a few other ways to do it though
What exactly is the issue with lines 46-50? The query seemed to execute fine and also I believe the order of execution in a sql query does grouping before select, so shouldn't those case conditions accurately handle the three different levels of geographic granularity? |
Yeah, you're right, I didn't think the grouping set's returned results was what was called with the lines 46-50. This also proves the behavior - |
ganymede/models/wolfram/wolfram.yml
Outdated
| - accepted_values: | ||
| values: ['NL', 'PE', 'NS', 'NB', 'NB', 'QC', 'ON', 'MB', 'SK', 'AB', 'BC', 'YT', 'NT', 'NU'] No newline at end of file | ||
| values: ['NL', 'PE', 'NS', 'NB', 'NB', 'QC', 'ON', 'MB', 'SK', 'AB', 'BC', 'YT', 'NT', 'NU'] | ||
| - name: sum_outside_one_stddev |
There was a problem hiding this comment.
| - name: sum_outside_one_stddev | |
| - name: sum_listings_outside_one_stddev |
Contains a
tests/directory for custom tests. Inside are two thingscolumn_namecolumn which fall outsidestddev_coefstandard deviations of the previouslookbackrecordsproduct_timeseries_metricstableNOTE: The
product_timeseries_metricsrecords have some quirks--i.e. they can be grouped according to different geographic levels (country, province, census division) as well as different units (lbs, /1kg). To perform an actually meaningful test of this kind, the records need to get partitioned by all of these columns. Otherwise, we would end up doing weird stuff like comparing the daily price from a specific census division such as Wellington against the past month's prices in that specific region but also in all of Ontario and all of Canada. I didn't know how to design the generic test to accommodate all these extra dimensions, so I just made a singular test. But, there is also the generic test in case we ever want to use it.Also, my linter isn't working so I will eventually recommit this when I fix it :(