Skip to content

fix: override default fleet-agent metrics and health bind addresses#2173

Merged
salasberryfin merged 1 commit intorancher:mainfrom
anmazzotti:update_fleet_agent_default_ports
Mar 3, 2026
Merged

fix: override default fleet-agent metrics and health bind addresses#2173
salasberryfin merged 1 commit intorancher:mainfrom
anmazzotti:update_fleet_agent_default_ports

Conversation

@anmazzotti
Copy link
Contributor

@anmazzotti anmazzotti commented Mar 2, 2026

What this PR does / why we need it:

Original finding: #2154 (comment)

I found out that on GKE the fleet-agent fails to initialize correctly since the metrics port 8080 is already bound (also see the reserved port list).

{"level":"error","ts":"2026-02-26T15:32:50Z","logger":"setup","msg":"problem running manager","error":"failed to start metrics server: failed to create listener: listen tcp :8080: bind: address already in use","stacktrace":"github.com/rancher/fleet/internal/cmd/agent.start\n\t/home/runner/_work/fleet/fleet/internal/cmd/agent/operator.go:181\ngithub.com/rancher/fleet/internal/cmd/agent.(*FleetAgent).Run.func1\n\t/home/runner/_work/fleet/fleet/internal/cmd/agent/root.go:143"}
{"level":"error","ts":"2026-02-26T15:32:50Z","logger":"setup","msg":"failed to start agent","error":"failed to start metrics server: failed to create listener: listen tcp :8080: bind: address already in use","stacktrace":"github.com/rancher/fleet/internal/cmd/agent.(*FleetAgent).Run.func1\n\t/home/runner/_work/fleet/fleet/internal/cmd/agent/root.go:144"}
{"level":"error","ts":"2026-02-26T15:32:50Z","logger":"controller-runtime.source.Kind","msg":"failed to get informer from cache","error":"Timeout: failed waiting for *v1alpha1.BundleDeployment Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.22.4/pkg/internal/source/kind.go:80\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.35.0/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.35.0/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.35.0/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.22.4/pkg/internal/source/kind.go:68"}
{"level":"error","ts":"2026-02-26T15:33:05Z","logger":"clusterstatus","msg":"failed to report initial cluster status","cluster":"cluster-gke-pq8v6b","interval":900,"error":"client rate limiter Wait returned an error: context canceled","stacktrace":"github.com/rancher/fleet/internal/cmd/agent/clusterstatus.Ticker.func1\n\t/home/runner/_work/fleet/fleet/internal/cmd/agent/clusterstatus/ticker.go:42"}

This can be configured using fleet-agent environment variables, however there are two issues:

  1. The FleetAddonConfig is embedded in the rancher-turtles-providers chart (and previously it was embedded in the turtles one)
  2. CAAPF does not allow changing the configuration per Cluster (see Allow different agent configurations per Cluster cluster-api-addon-provider-fleet#428)

So I see no other way than changing this for all Clusters and for all rancher-turtles-providers chart users.
This is an opinionated choice, however since we also use the hostNetwork setting, trying to bind to 18080 and 18081 is probably safer in most cases.

This however has the consequence of rolling out the fleet-agent on already provisioned Clusters to bind to the newly set ports, which is surely going to be an unexpected change for current users.
Chart configuration values have been added so that users can default back to 8080 and 8081 if they wish to.

Test run that includes this change: https://github.com/rancher/turtles/actions/runs/22565380279/job/65360516383

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Checklist:

  • squashed commits into logical changes
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

Signed-off-by: Andrea Mazzotti <andrea.mazzotti@suse.com>
@anmazzotti anmazzotti requested a review from a team as a code owner March 2, 2026 08:26
@anmazzotti anmazzotti marked this pull request as draft March 2, 2026 08:27
@anmazzotti anmazzotti self-assigned this Mar 2, 2026
@anmazzotti anmazzotti added kind/bug Something isn't working area/fleet labels Mar 2, 2026
@anmazzotti anmazzotti moved this to In Progress (8 max) in CAPI / Turtles Mar 2, 2026
@anmazzotti anmazzotti moved this from In Progress (8 max) to PR to be reviewed in CAPI / Turtles Mar 2, 2026
@anmazzotti anmazzotti marked this pull request as ready for review March 2, 2026 11:13
Copy link
Contributor

@yiannistri yiannistri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with

 helm template rancher-turtles-providers

and

 helm template rancher-turtles-providers --set extras.addonFleet.config.enabled=false

Thank you!

@salasberryfin salasberryfin merged commit 23f6794 into rancher:main Mar 3, 2026
21 of 38 checks passed
@github-project-automation github-project-automation bot moved this from PR to be reviewed to Done in CAPI / Turtles Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/fleet kind/bug Something isn't working

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants