fix: override default fleet-agent metrics and health bind addresses#2173
Merged
salasberryfin merged 1 commit intorancher:mainfrom Mar 3, 2026
Merged
Conversation
Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
yiannistri
approved these changes
Mar 3, 2026
salasberryfin
approved these changes
Mar 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
Original finding: #2154 (comment)
I found out that on GKE the
fleet-agentfails to initialize correctly since the metrics port8080is already bound (also see the reserved port list).{"level":"error","ts":"2026-02-26T15:32:50Z","logger":"setup","msg":"problem running manager","error":"failed to start metrics server: failed to create listener: listen tcp :8080: bind: address already in use","stacktrace":"github.com/rancher/fleet/internal/cmd/agent.start\n\t/home/runner/_work/fleet/fleet/internal/cmd/agent/operator.go:181\ngithub.com/rancher/fleet/internal/cmd/agent.(*FleetAgent).Run.func1\n\t/home/runner/_work/fleet/fleet/internal/cmd/agent/root.go:143"} {"level":"error","ts":"2026-02-26T15:32:50Z","logger":"setup","msg":"failed to start agent","error":"failed to start metrics server: failed to create listener: listen tcp :8080: bind: address already in use","stacktrace":"github.com/rancher/fleet/internal/cmd/agent.(*FleetAgent).Run.func1\n\t/home/runner/_work/fleet/fleet/internal/cmd/agent/root.go:144"} {"level":"error","ts":"2026-02-26T15:32:50Z","logger":"controller-runtime.source.Kind","msg":"failed to get informer from cache","error":"Timeout: failed waiting for *v1alpha1.BundleDeployment Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.22.4/pkg/internal/source/kind.go:80\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.35.0/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.35.0/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.35.0/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.22.4/pkg/internal/source/kind.go:68"} {"level":"error","ts":"2026-02-26T15:33:05Z","logger":"clusterstatus","msg":"failed to report initial cluster status","cluster":"cluster-gke-pq8v6b","interval":900,"error":"client rate limiter Wait returned an error: context canceled","stacktrace":"github.com/rancher/fleet/internal/cmd/agent/clusterstatus.Ticker.func1\n\t/home/runner/_work/fleet/fleet/internal/cmd/agent/clusterstatus/ticker.go:42"}This can be configured using fleet-agent environment variables, however there are two issues:
So I see no other way than changing this for all Clusters and for all rancher-turtles-providers chart users.
This is an opinionated choice, however since we also use the
hostNetworksetting, trying to bind to18080and18081is probably safer in most cases.This however has the consequence of rolling out the
fleet-agenton already provisioned Clusters to bind to the newly set ports, which is surely going to be an unexpected change for current users.Chart configuration values have been added so that users can default back to
8080and8081if they wish to.Test run that includes this change: https://github.com/rancher/turtles/actions/runs/22565380279/job/65360516383
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Checklist: