-
Notifications
You must be signed in to change notification settings - Fork 101
Split the config maps based on size while being created. So that many config maps are created with max size of 950KB and rule evaluator loads all configmaps to file for rules generator #1801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Split the config maps based on size while being created. So that many config maps are created with max size of 950KB and rule evaluator loads all configmaps to file for rules generator #1801
Conversation
… config maps are created with max size of 950KB and rule evaluator loads all configmaps to file for rules generator
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
Thanks! So this makes sense on operator side, but how rule-eval should consume this? rule-eval currently only reads one config map and there's no easy way (or is there?) to tell rule-val pod to load dynamic number of those files? 🤔 We can't change deployment dynamically in practice within the security constraints we need to work with for managed GMP solution at the moment. That would be the only solution, right? |
Create one ConfigMap per rule type (rules, clusterrules, globalrules) to work around the 1MB Kubernetes ConfigMap size limit. Each type stores all resources of that type in a single ConfigMap with retry logic and error tracking. Changes: - Implement one ConfigMap per type approach (rules, clusterrules, globalrules) - Add retry logic with exponential backoff for ConfigMap operations - Update deployment to mount 3 ConfigMaps via projected volumes - Fix all Dockerfiles to use awk instead of yq for version extraction - Add comprehensive tests for ConfigMap creation and recovery - Update documentation with architecture details Total capacity increased from 1MB to 3MB (1MB per type).
9dfd927 to
e873885
Compare
Thanks for the valuable input. Please check now, I have made some changes with respect to your comments. |
|
Nice, sounds like the plan would be to do projections and split by 3 at least. Is that solving your use case? Do you have more or less equal distribution of rules across those types? We could do projection of 10 then and split up to 10? Would that be reasonable? |
|
Also before we add some complexity, have you tried compression option? https://github.com/GoogleCloudPlatform/prometheus-engine/blob/main/doc/api.md#monitoring.googleapis.com/v1.ConfigSpec |
Yes, after compression it is 1.3 MB, only the GlobalRules. |
Not equally distributed. We are only using GlobalRules, but we can use Rules and ClusterRules as well. If this 1MB problem is solved or extended limit of 3MB then, we can live with that for a while and design our alerts in distributed using these 3 types. |
# |<---- Max 80 chars ---->|
#
# Types: build, chore, ci, docs, perf, refactor, revert, style, test
# Scopes: configs, deps, e2e, export, main, operator, prometheus,
# frontend, datasource-syncer, config-reloader, rule-evaluator
#
# Rules:
# - Use lowercase
# - Use imperative mood ("add" not "adding")
# - No period at end of header
# - Body wraps at 72 chars
#
# Example:
# refactor(operator): split rules into separate configmaps per type
#
# Create one ConfigMap per rule type (rules, clusterrules, globalrules)
# to work around the 1MB Kubernetes ConfigMap size limit.
#
# - Implement one ConfigMap per type approach
# - Add retry logic with exponential backoff
# - Update deployment to mount ConfigMaps
# - Fix the linting issues
Fix all 8 golangci-lint violations and manifest comment format: - Add periods to comments (godot) - Use integer range for Go 1.22+ (intrange) - Use t.Context() in tests instead of context.Background() (usetesting) - Update manifest comments to match regeneration format Files changed: - pkg/operator/rules.go: Add periods to function comments, use range loop - pkg/operator/rules_test.go: Add period to type comment, use t.Context() - manifests/rule-evaluator.yaml: Update ConfigMap mount comments
Restore status update logic that was accidentally removed during the ConfigMap-per-type refactoring. This ensures Rule/ClusterRules/GlobalRules objects have their MonitoringStatus properly updated with success/failure conditions. Also fix golangci-lint errors and update tests to use newFakeClientBuilder for proper status subresource support. Changes: - pkg/operator/rules.go: Restore status tracking and patchMonitoringStatus calls - pkg/operator/rules.go: Fix linter errors (godot, intrange, usetesting) - pkg/operator/rules_test.go: Use newFakeClientBuilder in new tests - pkg/operator/rules_test.go: Fix linter errors, remove unused imports - manifests/rule-evaluator.yaml: Update ConfigMap mount comments Fixes failing TestRulesStatus and TestEnsureRuleConfigs tests. All tests now pass successfully.
Update e2e tests to work with the new ConfigMap-per-type architecture. The rules are now split into three ConfigMaps (rules, clusterrules, globalrules) instead of a single rules-generated ConfigMap. Changes: - e2e/ruler_test.go: Aggregate data from all three ConfigMaps in test - pkg/operator/rules.go: Update controller to watch all three ConfigMaps - pkg/operator/rules.go: Rename constants and update predicates This fixes the TestAlertmanager/rules-create timeout failure. Also includes previous fixes for status updates, linter errors, and manifest formatting.
Why:
Since the k8s configmaps have hard limit of 1MB, when there is a need of creating more GlobalRules, ClusterRules or Rules, all are going into one ConfigMap by rule-generator and from there it is being mounted as volume to rule-evaluator. Which brings the hard limit of alert definitions in one GKE cluster which is enabled the GMP can not go beyond 1 MB size.
This change will overcome that hard limit.