Bug 1944375 - Improve data cycling/deletion for performance datum replicates #9136
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi there :)
I worked on an improvement for the cycling data job. (https://bugzilla.mozilla.org/show_bug.cgi?id=1944375)
Issue & Background:
The cycling data job currently takes about 4-12 hours to complete. The main cause is related to a
PerformanceDatumReplicatetable. ThePerformanceDatumReplicatehas a foreign key relationship to aPerformanceDatumtable[0].When
PerformanceDatumrows are deleted, the correspondingPerformanceDatumReplicaterows are automatically removed via Django ORM cascade behavior.During the cycling job:
PerformanceDatum) per chunk.PerformanceDatumReplicateare resolved as part of the deletion process.PerformanceDatumReplicatecurrently contains over 500 million rows.sequential (full) table scanonPerformanceDatumReplicate.Even when executed only once per chunk, it can take 3-4 minutes.
Since the cycling job runs this logic repeatedly across many chunks, the total execution time grows to several hours.
This PR focuses on improving that behavior by avoiding repeated full table scans and reducing the overall runtime of the cycling data job.
Notes
PerformanceDatumReplicaterows are automatically removed via Django ORM cascade behavior whenPerformanceDatumrows are deleted.del_replicateCTEtarget_datumalready enforces these conditions, keeping the same filters indel_replicatehelps the postrgres planner choose a more stable execution plan.performance_datum_replicate.EXISTSin thedel_replicateCTEEXISTSclause is used to check whether replicate rows exist before touchingperformance_datum_replicate.ORDER BYin thetarget_datumCTE of the MainRemovalStrategyORDER BYallows the planner to more consistently leverage index scans.[0]
treeherder/treeherder/perf/models.py
Line 276 in a82c683
[1] Only as a reference: #9107