fix: wait for cluster to stabilize before creating podmonitor#47
fix: wait for cluster to stabilize before creating podmonitor#47ardentperf wants to merge 1 commit intocloudnative-pg:mainfrom
Conversation
| # cf. https://github.com/cloudnative-pg/cnpg-playground/issues/46 | ||
| if check_crd_existence podmonitors.monitoring.coreos.com | ||
| then | ||
| sleep 5 |
There was a problem hiding this comment.
Probably to make sure here that the podmonitors are going to be created what is required is just to make sure that the CRDs exists? why waiting 5 seconds here will fix an issue not related to the cnpg cluster? the podmonitor object will be created by the prometheus operator it is weird that you need to wait here
There was a problem hiding this comment.
fwiw, this cleanly reproduces on GH runners with the CICD test framework from #48 which is very close to being a repro on main (#48 doesn't change relevant existing code, it's just adding new code and tests). https://github.com/ardentperf/cnpg-playground/actions/runs/20797196550/job/59733660471
note that the only difference between PR #48 having a successful run of the test and this failed run of the test is that branch repro-issue-46 doesn't include the commit from this PR
from lines 682 and 687 in the GH action test output, we can see that the podmonitors disappear after cnpg finishes creating the clusters
yes it's weird that the 5 second sleep and ordering change works as a remediation. if you have cycles to debug that would be great - i wasn't able to figure it out yet, and this PR seems harmless enough until someone gets an RCA and the underlying issue is fixed. i wonder if there's a race condition where the CNPG operator itself is removing the podmonitor.
There was a problem hiding this comment.
isn't the condition above enough? can we try and remove the 5 second sleep here?
There was a problem hiding this comment.
IIRC I needed the sleep - but I can run a test in a GitHub action of switching the order without the 5 second pause, since it reproduces easily, and post a link to the test results here
cfb175b to
38f93b6
Compare
|
rebased and resolved conflicts w #53 |
There appears to be a timing-related bug or race condition where the first kubectl apply for PodMonitors during the setup script execution reports success (HTTP 201 Created from the API server) but the resource doesn't persist. The exact root cause is unclear, but moving podmonitor creation to the end and giving the cluster a few seconds to stabilize before creating podmonitors seems to work around the issue. Closes cloudnative-pg#46 Signed-off-by: Jeremy Schneider <schneider@ardentperf.com>
38f93b6 to
96ad03e
Compare
|
This might actually be an instance of cloudnative-pg/cloudnative-pg#6109 I'm using the nix devshell which pins the version of the kubectl cnpg plugin. I wonder if this might also be pinning the version of the cnpg operator to 1.24.0 based on the version of the plugin. Will check if the issue reproduces with trunk ~ |
|
i think it was in fact the podmonitor naming. closing this PR in favor of #57 |
There appears to be a timing-related bug or race condition where the
first kubectl apply for PodMonitors during the setup script execution
reports success (HTTP 201 Created from the API server) but the resource
doesn't persist. The exact root cause is unclear, but moving podmonitor
creation to the end and giving the cluster a few seconds to stabilize
before creating podmonitors seems to work around the issue.
Closes #46
Signed-off-by: Jeremy Schneider schneider@ardentperf.com