Builtin Linting Rule: cronintervalalignment#5091
Conversation
sqlmesh/core/linter/rules/builtin.py
Outdated
| continue | ||
|
|
||
| # Skip model kinds since they don't run on cron schedules | ||
| skip_kinds = ["EXTERNAL", "EMBEDDED", "SEED"] |
There was a problem hiding this comment.
I suggest using enum values instead of hardcoded strings (ModelKindName)
sqlmesh/core/linter/rules/builtin.py
Outdated
| ) | ||
|
|
||
|
|
||
| class CronValidator(Rule): |
There was a problem hiding this comment.
Should the name be more specific? I feel like there can be multiple approaches to validating cron
There was a problem hiding this comment.
What about this? CronAlignmentValidator
There was a problem hiding this comment.
Or CronIntervalAlignment
There was a problem hiding this comment.
the last one is good, thanks
tests/core/linter/test_builtin.py
Outdated
|
|
||
| downstream_model = load_sql_based_model( | ||
| d.parse( | ||
| "MODEL (name memory.sushi.step_c, cron '@daily', depends_on ['memory.sushi.step_1', 'memory.sushi.step_b']); SELECT * FROM (SELECT 1)" |
There was a problem hiding this comment.
ah shoot, I need to update step_1 to step_a
sqlmesh/core/linter/rules/builtin.py
Outdated
|
|
||
| upstream_model_cron_next = upstream_model.cron_next(placeholder_start_date) | ||
|
|
||
| if upstream_model_cron_next > this_model_cron_next: |
There was a problem hiding this comment.
I thought about this more and I don't think this is the right way to handle this. Imagine graph A (upstream) <- B (downstream) with crons
- A:
5 * * * *(run at 5th minute of every hour) - B:
0 * * * *(run every hour)
Both models are hourly, so technically the upstream's cron interval is not "longer" as the violation suggests. However, B will run before A which might not be the desired behavior.
But if it was this instead:
- A:
15 10 * * *(run at 10:15AM of every day) - B:
0 * * * *(run every hour)
Your check will return false given the current placeholder start date, because cron_next for A will be 10:15 but for B it will be 11:00. In this case upstream is daily, and its interval is indeed longer than Bs (hourly). You probably want your check to fail but it won't.
So if the length of the interval is what you're trying to capture here then you want to do the following instead:
upstream_cron_interval_unit = IntervalUnit.from_cron(upstream_model.cron)
this_cron_interval_unit = IntervalUnit.from_cron(model.cron)
if upstream_cron_interval_unit.seconds > this_cron_interval_unit.seconds:
# violation
and move away from the placehoder date.
There was a problem hiding this comment.
This was actually my first approach, but it won't work in isolation because the IntervalUnit is limited: https://github.com/TobikoData/sqlmesh/blob/5861dbaac291113b504380456cc1b38a897611cc/sqlmesh/core/node.py#L39-L45
For example, @weekly and @daily will be interpreted as being equal in interval unit seconds because they are both categorized as DAY.
upstream_cron_interval_unit = IntervalUnit.from_cron(upstream_model.cron)
this_cron_interval_unit = IntervalUnit.from_cron(model.cron)
if upstream_cron_interval_unit.seconds > this_cron_interval_unit.seconds:
# violationNew Approach: Use an OR condition where if upstream model's next cron is greater OR its interval unit in seconds is greater, return a violation.
Example in action for your first scenario: You'll notice the placeholder date implementation works as expected. This is NOT a rule violation because model B runs AFTER model A.
A: 5 * * * * (run at 5th minute of every hour)
B: 0 * * * * (run every hour)
(Pdb) upstream_cron_interval_unit.seconds
3600
(Pdb) this_cron_interval_unit.seconds
3600
(Pdb) upstream_model_cron_next
datetime.datetime(2020, 1, 1, 10, 5, tzinfo=datetime.timezone.utc)
(Pdb) this_model_cron_next
datetime.datetime(2020, 1, 1, 11, 0, tzinfo=datetime.timezone.utc)There was a problem hiding this comment.
Note: it's important WHEN models are linted because when it's to deployed to prod determines the next run. For nuanced cron schedules, a model applied at a specific time can emit violation vs. not.
…g/lint-cron-validator
|
Closing this to prioritize other work |
This checks whether cron schedules will behave as expected. This prevents misalignment in scheduling expectations.
For example, if I have a DAG like this: step_1(
@weekly) -> step_2(@daily) -> step_3(*/5 * * * *)step_2 and step_3 will only process empty intervals because they're anchored to the intervals processed in step_1. This rule prevents confusion on why step_2 and step_3 are wastefully backfilling empty intervals. The fix is to align the schedules where a downstream model's cron is the same or has a longer interval than an upstream model's.
This rule allows the below DAG for step_2 as it's a valid use case. step_2 should still run daily and not emit a rule violation because it's processing backfilled intervals from step_1b. step_3 will still violate this linting rule.
[step_1a(
@weekly). step_1b(@hourly)] -> step_2(@daily) -> step_3(*/5 * * * *)In use:
