Skip to content

Conversation

@balakumarpg
Copy link

@balakumarpg balakumarpg commented Nov 10, 2025

Why:
Since the k8s configmaps have hard limit of 1MB, when there is a need of creating more GlobalRules, ClusterRules or Rules, all are going into one ConfigMap by rule-generator and from there it is being mounted as volume to rule-evaluator. Which brings the hard limit of alert definitions in one GKE cluster which is enabled the GMP can not go beyond 1 MB size.

This change will overcome that hard limit.

… config maps are created with max size of 950KB and rule evaluator loads all configmaps to file for rules generator
@google-cla
Copy link

google-cla bot commented Nov 10, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@bwplotka
Copy link
Collaborator

Thanks!

So this makes sense on operator side, but how rule-eval should consume this? rule-eval currently only reads one config map and there's no easy way (or is there?) to tell rule-val pod to load dynamic number of those files? 🤔

We can't change deployment dynamically in practice within the security constraints we need to work with for managed GMP solution at the moment. That would be the only solution, right?

Create one ConfigMap per rule type (rules, clusterrules, globalrules) to work
around the 1MB Kubernetes ConfigMap size limit. Each type stores all resources
of that type in a single ConfigMap with retry logic and error tracking.

Changes:
- Implement one ConfigMap per type approach (rules, clusterrules, globalrules)
- Add retry logic with exponential backoff for ConfigMap operations
- Update deployment to mount 3 ConfigMaps via projected volumes
- Fix all Dockerfiles to use awk instead of yq for version extraction
- Add comprehensive tests for ConfigMap creation and recovery
- Update documentation with architecture details

Total capacity increased from 1MB to 3MB (1MB per type).
@balakumarpg balakumarpg force-pushed the fix-rule-evaluator-config-map-size-issue branch from 9dfd927 to e873885 Compare November 23, 2025 22:06
@balakumarpg
Copy link
Author

Thanks!

So this makes sense on operator side, but how rule-eval should consume this? rule-eval currently only reads one config map and there's no easy way (or is there?) to tell rule-val pod to load dynamic number of those files? 🤔

We can't change deployment dynamically in practice within the security constraints we need to work with for managed GMP solution at the moment. That would be the only solution, right?

Thanks for the valuable input. Please check now, I have made some changes with respect to your comments.

@bwplotka
Copy link
Collaborator

Nice, sounds like the plan would be to do projections and split by 3 at least. Is that solving your use case? Do you have more or less equal distribution of rules across those types?

We could do projection of 10 then and split up to 10? Would that be reasonable?

@bwplotka
Copy link
Collaborator

Also before we add some complexity, have you tried compression option? https://github.com/GoogleCloudPlatform/prometheus-engine/blob/main/doc/api.md#monitoring.googleapis.com/v1.ConfigSpec

@balakumarpg
Copy link
Author

Also before we add some complexity, have you tried compression option? main/doc/api.md#monitoring.googleapis.com/v1.ConfigSpec

Yes, after compression it is 1.3 MB, only the GlobalRules.

@balakumarpg
Copy link
Author

Nice, sounds like the plan would be to do projections and split by 3 at least. Is that solving your use case? Do you have more or less equal distribution of rules across those types?

We could do projection of 10 then and split up to 10? Would that be reasonable?

Not equally distributed. We are only using GlobalRules, but we can use Rules and ClusterRules as well. If this 1MB problem is solved or extended limit of 3MB then, we can live with that for a while and design our alerts in distributed using these 3 types.

# |<----  Max 80 chars  ---->|
#
# Types: build, chore, ci, docs, perf, refactor, revert, style, test
# Scopes: configs, deps, e2e, export, main, operator, prometheus,
#         frontend, datasource-syncer, config-reloader, rule-evaluator
#
# Rules:
# - Use lowercase
# - Use imperative mood ("add" not "adding")
# - No period at end of header
# - Body wraps at 72 chars
#
# Example:
# refactor(operator): split rules into separate configmaps per type
#
# Create one ConfigMap per rule type (rules, clusterrules, globalrules)
# to work around the 1MB Kubernetes ConfigMap size limit.
#
# - Implement one ConfigMap per type approach
# - Add retry logic with exponential backoff
# - Update deployment to mount ConfigMaps
# - Fix the linting issues
Fix all 8 golangci-lint violations and manifest comment format:
- Add periods to comments (godot)
- Use integer range for Go 1.22+ (intrange)
- Use t.Context() in tests instead of context.Background() (usetesting)
- Update manifest comments to match regeneration format

Files changed:
- pkg/operator/rules.go: Add periods to function comments, use range loop
- pkg/operator/rules_test.go: Add period to type comment, use t.Context()
- manifests/rule-evaluator.yaml: Update ConfigMap mount comments
Restore status update logic that was accidentally removed during the
ConfigMap-per-type refactoring. This ensures Rule/ClusterRules/GlobalRules
objects have their MonitoringStatus properly updated with success/failure
conditions.

Also fix golangci-lint errors and update tests to use newFakeClientBuilder
for proper status subresource support.

Changes:
- pkg/operator/rules.go: Restore status tracking and patchMonitoringStatus calls
- pkg/operator/rules.go: Fix linter errors (godot, intrange, usetesting)
- pkg/operator/rules_test.go: Use newFakeClientBuilder in new tests
- pkg/operator/rules_test.go: Fix linter errors, remove unused imports
- manifests/rule-evaluator.yaml: Update ConfigMap mount comments

Fixes failing TestRulesStatus and TestEnsureRuleConfigs tests.
All tests now pass successfully.
Update e2e tests to work with the new ConfigMap-per-type architecture.
The rules are now split into three ConfigMaps (rules, clusterrules,
globalrules) instead of a single rules-generated ConfigMap.

Changes:
- e2e/ruler_test.go: Aggregate data from all three ConfigMaps in test
- pkg/operator/rules.go: Update controller to watch all three ConfigMaps
- pkg/operator/rules.go: Rename constants and update predicates

This fixes the TestAlertmanager/rules-create timeout failure.

Also includes previous fixes for status updates, linter errors, and
manifest formatting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants