Skip to content

Builtin Linting Rule: cronintervalalignment#5091

Closed
sungchun12 wants to merge 17 commits intomainfrom
sung/lint-cron-validator
Closed

Builtin Linting Rule: cronintervalalignment#5091
sungchun12 wants to merge 17 commits intomainfrom
sung/lint-cron-validator

Conversation

@sungchun12
Copy link
Contributor

@sungchun12 sungchun12 commented Aug 4, 2025

This checks whether cron schedules will behave as expected. This prevents misalignment in scheduling expectations.

For example, if I have a DAG like this: step_1(@weekly) -> step_2(@daily) -> step_3(*/5 * * * *)

step_2 and step_3 will only process empty intervals because they're anchored to the intervals processed in step_1. This rule prevents confusion on why step_2 and step_3 are wastefully backfilling empty intervals. The fix is to align the schedules where a downstream model's cron is the same or has a longer interval than an upstream model's.

This rule allows the below DAG for step_2 as it's a valid use case. step_2 should still run daily and not emit a rule violation because it's processing backfilled intervals from step_1b. step_3 will still violate this linting rule.

[step_1a(@weekly). step_1b(@hourly)] -> step_2(@daily) -> step_3(*/5 * * * *)

In use:
image

image

@sungchun12 sungchun12 requested a review from a team August 4, 2025 19:22
continue

# Skip model kinds since they don't run on cron schedules
skip_kinds = ["EXTERNAL", "EMBEDDED", "SEED"]
Copy link
Collaborator

@izeigerman izeigerman Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using enum values instead of hardcoded strings (ModelKindName)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

)


class CronValidator(Rule):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the name be more specific? I feel like there can be multiple approaches to validating cron

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this? CronAlignmentValidator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or CronIntervalAlignment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the last one is good, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@sungchun12 sungchun12 requested a review from izeigerman August 4, 2025 19:53
@sungchun12 sungchun12 changed the title Builtin Linting Rule: cronvalidator Builtin Linting Rule: cronintervalalignment Aug 4, 2025

downstream_model = load_sql_based_model(
d.parse(
"MODEL (name memory.sushi.step_c, cron '@daily', depends_on ['memory.sushi.step_1', 'memory.sushi.step_b']); SELECT * FROM (SELECT 1)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah shoot, I need to update step_1 to step_a


upstream_model_cron_next = upstream_model.cron_next(placeholder_start_date)

if upstream_model_cron_next > this_model_cron_next:
Copy link
Collaborator

@izeigerman izeigerman Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this more and I don't think this is the right way to handle this. Imagine graph A (upstream) <- B (downstream) with crons

  • A: 5 * * * * (run at 5th minute of every hour)
  • B: 0 * * * * (run every hour)

Both models are hourly, so technically the upstream's cron interval is not "longer" as the violation suggests. However, B will run before A which might not be the desired behavior.

But if it was this instead:

  • A: 15 10 * * * (run at 10:15AM of every day)
  • B: 0 * * * * (run every hour)

Your check will return false given the current placeholder start date, because cron_next for A will be 10:15 but for B it will be 11:00. In this case upstream is daily, and its interval is indeed longer than Bs (hourly). You probably want your check to fail but it won't.

So if the length of the interval is what you're trying to capture here then you want to do the following instead:

upstream_cron_interval_unit = IntervalUnit.from_cron(upstream_model.cron)
this_cron_interval_unit = IntervalUnit.from_cron(model.cron)

if upstream_cron_interval_unit.seconds > this_cron_interval_unit.seconds:
    # violation

and move away from the placehoder date.

Copy link
Contributor Author

@sungchun12 sungchun12 Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually my first approach, but it won't work in isolation because the IntervalUnit is limited: https://github.com/TobikoData/sqlmesh/blob/5861dbaac291113b504380456cc1b38a897611cc/sqlmesh/core/node.py#L39-L45

For example, @weekly and @daily will be interpreted as being equal in interval unit seconds because they are both categorized as DAY.

upstream_cron_interval_unit = IntervalUnit.from_cron(upstream_model.cron)
this_cron_interval_unit = IntervalUnit.from_cron(model.cron)

if upstream_cron_interval_unit.seconds > this_cron_interval_unit.seconds:
    # violation

New Approach: Use an OR condition where if upstream model's next cron is greater OR its interval unit in seconds is greater, return a violation.

Example in action for your first scenario: You'll notice the placeholder date implementation works as expected. This is NOT a rule violation because model B runs AFTER model A.

A: 5 * * * * (run at 5th minute of every hour)
B: 0 * * * * (run every hour)

(Pdb) upstream_cron_interval_unit.seconds
3600
(Pdb) this_cron_interval_unit.seconds
3600
(Pdb) upstream_model_cron_next
datetime.datetime(2020, 1, 1, 10, 5, tzinfo=datetime.timezone.utc)
(Pdb) this_model_cron_next
datetime.datetime(2020, 1, 1, 11, 0, tzinfo=datetime.timezone.utc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: it's important WHEN models are linted because when it's to deployed to prod determines the next run. For nuanced cron schedules, a model applied at a specific time can emit violation vs. not.

@sungchun12 sungchun12 requested a review from izeigerman August 6, 2025 17:27
@sungchun12 sungchun12 closed this Aug 7, 2025
@sungchun12
Copy link
Contributor Author

Closing this to prioritize other work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants