fix: wait for cluster to stabilize before creating podmonitor by ardentperf · Pull Request #47 · cloudnative-pg/cnpg-playground

ardentperf · 2026-01-06T00:23:09Z

There appears to be a timing-related bug or race condition where the
first kubectl apply for PodMonitors during the setup script execution
reports success (HTTP 201 Created from the API server) but the resource
doesn't persist. The exact root cause is unclear, but moving podmonitor
creation to the end and giving the cluster a few seconds to stabilize
before creating podmonitors seems to work around the issue.

Closes #46

Signed-off-by: Jeremy Schneider schneider@ardentperf.com

sxd · 2026-01-07T12:36:17Z

demo/setup.sh

+   #    cf. https://github.com/cloudnative-pg/cnpg-playground/issues/46
   if check_crd_existence podmonitors.monitoring.coreos.com
   then
+     sleep 5


Probably to make sure here that the podmonitors are going to be created what is required is just to make sure that the CRDs exists? why waiting 5 seconds here will fix an issue not related to the cnpg cluster? the podmonitor object will be created by the prometheus operator it is weird that you need to wait here

fwiw, this cleanly reproduces on GH runners with the CICD test framework from #48 which is very close to being a repro on main (#48 doesn't change relevant existing code, it's just adding new code and tests). https://github.com/ardentperf/cnpg-playground/actions/runs/20797196550/job/59733660471

note that the only difference between PR #48 having a successful run of the test and this failed run of the test is that branch repro-issue-46 doesn't include the commit from this PR

from lines 682 and 687 in the GH action test output, we can see that the podmonitors disappear after cnpg finishes creating the clusters

yes it's weird that the 5 second sleep and ordering change works as a remediation. if you have cycles to debug that would be great - i wasn't able to figure it out yet, and this PR seems harmless enough until someone gets an RCA and the underlying issue is fixed. i wonder if there's a race condition where the CNPG operator itself is removing the podmonitor.

isn't the condition above enough? can we try and remove the 5 second sleep here?

IIRC I needed the sleep - but I can run a test in a GitHub action of switching the order without the 5 second pause, since it reproduces easily, and post a link to the test results here

ardentperf · 2026-01-14T21:20:59Z

rebased and resolved conflicts w #53

There appears to be a timing-related bug or race condition where the first kubectl apply for PodMonitors during the setup script execution reports success (HTTP 201 Created from the API server) but the resource doesn't persist. The exact root cause is unclear, but moving podmonitor creation to the end and giving the cluster a few seconds to stabilize before creating podmonitors seems to work around the issue. Closes cloudnative-pg#46 Signed-off-by: Jeremy Schneider <schneider@ardentperf.com>

ardentperf · 2026-01-15T00:36:32Z

This might actually be an instance of cloudnative-pg/cloudnative-pg#6109

I'm using the nix devshell which pins the version of the kubectl cnpg plugin. I wonder if this might also be pinning the version of the cnpg operator to 1.24.0 based on the version of the plugin. Will check if the issue reproduces with trunk ~

ardentperf · 2026-01-15T04:14:16Z

i think it was in fact the podmonitor naming. closing this PR in favor of #57

ardentperf requested a review from a team as a code owner January 6, 2026 00:23

ardentperf mentioned this pull request Jan 7, 2026

feat: monitoring - CNPG dashboard deps, teardown, tests #48

Open

sxd reviewed Jan 7, 2026

View reviewed changes

ardentperf mentioned this pull request Jan 7, 2026

podmonitor does not exist after demo/setup.sh completes #46

Closed

ardentperf force-pushed the pr-podmonitor-sleep branch from cfb175b to 38f93b6 Compare January 14, 2026 21:18

ardentperf force-pushed the pr-podmonitor-sleep branch from 38f93b6 to 96ad03e Compare January 15, 2026 00:21

ardentperf closed this Jan 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: wait for cluster to stabilize before creating podmonitor#47

fix: wait for cluster to stabilize before creating podmonitor#47
ardentperf wants to merge 1 commit intocloudnative-pg:mainfrom
ardentperf:pr-podmonitor-sleep

ardentperf commented Jan 6, 2026

Uh oh!

sxd Jan 7, 2026

Uh oh!

ardentperf Jan 7, 2026 •

edited

Loading

Uh oh!

gbartolini Jan 14, 2026

Uh oh!

ardentperf Jan 14, 2026

Uh oh!

ardentperf commented Jan 14, 2026

Uh oh!

ardentperf commented Jan 15, 2026

Uh oh!

ardentperf commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ardentperf commented Jan 6, 2026

Uh oh!

sxd Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

ardentperf Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gbartolini Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

ardentperf Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

ardentperf commented Jan 14, 2026

Uh oh!

ardentperf commented Jan 15, 2026

Uh oh!

ardentperf commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ardentperf Jan 7, 2026 •

edited

Loading