Skip to content

Conversation

@junngo
Copy link
Contributor

@junngo junngo commented Dec 26, 2025

Hi there :)
I worked on an improvement for the cycling data job. (https://bugzilla.mozilla.org/show_bug.cgi?id=1944375)

Issue & Background:

The cycling data job currently takes about 4-12 hours to complete. The main cause is related to a PerformanceDatumReplicate table. The PerformanceDatumReplicate has a foreign key relationship to a PerformanceDatum table[0].
When PerformanceDatum rows are deleted, the corresponding PerformanceDatumReplicate rows are automatically removed via Django ORM cascade behavior.

During the cycling job:

  • The job selects up to 10,000 rows from the leading table (PerformanceDatum) per chunk.
  • For each chunk, related rows in PerformanceDatumReplicate are resolved as part of the deletion process.
    DELETE FROM performance_datum_replicate
    WHERE performance_datum_id IN ($1,$2,$3,$4,$5, ... ,$10,000)
    
  • PerformanceDatumReplicate currently contains over 500 million rows.
  • In certain query plans, this results in a sequential (full) table scan on PerformanceDatumReplicate.

Even when executed only once per chunk, it can take 3-4 minutes.
Since the cycling job runs this logic repeatedly across many chunks, the total execution time grows to several hours.

This PR focuses on improving that behavior by avoiding repeated full table scans and reducing the overall runtime of the cycling data job.

Notes

  1. Reason for using a raw query instead of the Django ORM
    • PerformanceDatumReplicate rows are automatically removed via Django ORM cascade behavior when PerformanceDatum rows are deleted.
    • In order to control the deletion process and influence the query plan at the SQL level, this logic needs to be implemented using a raw query rather than the ORM.
  2. Reason for keeping duplicate condition checks in the del_replicate CTE
    • Although target_datum already enforces these conditions, keeping the same filters in del_replicate helps the postrgres planner choose a more stable execution plan.
    • In practice, this helps the planner choose a more stable plan and consistently use index-based access paths for performance_datum_replicate.
  3. Reason for using EXISTS in the del_replicate CTE
    • The EXISTS clause is used to check whether replicate rows exist before touching performance_datum_replicate.
    • This avoids unnecessary access to the replicate table and is generally cheaper than performing a join or scan.
  4. Reason for using ORDER BY in the target_datum CTE of the MainRemovalStrategy
    • Using ORDER BY allows the planner to more consistently leverage index scans.
  5. Test results
    • The query plan was verified using Redash with the explain plan.
    • The resulting plans consistently used index scans instead of sequential table scans.
    • Execution time was observed to be approximately 0-10 seconds per chunk.
    • Actual performance may vary depending on the current state of the database.

[0]

performance_datum = models.ForeignKey(PerformanceDatum, on_delete=models.CASCADE)

[1] Only as a reference: #9107

@junngo junngo marked this pull request as ready for review December 30, 2025 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant