- Dynamic Node Manager
A Python service that watches Kubernetes namespaces, estimates YuniKorn queue demand (running + pending pods), and dynamically converts Slurm batch nodes into Kubernetes nodes when demand exceeds configured queue headroom or when bin-pack deadlocks are detected. When nodes go idle long enough, it reverts them back to Slurm and reduces YuniKorn queue capacity accordingly.
This tool is designed for environments where:
- Slurm/batch nodes can be switched into a Kubernetes pool via an external
node-convertscript. - YuniKorn is used for queue management and its capacities are controlled through a
queues.yamlstored in a ConfigMap.
You configure a set of monitored_namespaces. For each namespace, you also configure a corresponding YuniKorn queue_path.
For each namespace’s queue path, it:
- Lists pods in that namespace.
- Filters pods to those assigned to the queue path (by annotation/label).
- Sums effective CPU/memory for:
- Running pods (counted as “used”)
- Pending pods (counted as “pending demand”)
- Applies default requests for Pending pods that declare no resources (so “invisible” pending work still triggers scaling).
Conversion is triggered when any of these is true:
- Total demand (running + pending) exceeds queue headroom for CPU or memory.
- Pending demand exceeds available capacity (after subtracting running usage).
- A “bin-pack blocked” condition is detected (see below).
When scaling out is required, the actuator:
- Runs
/usr/site/rcac/sbin/node-convert --set k8s ... --namespace <ns> - Waits for new node(s) to appear in the Kubernetes API
- Records converted nodes in a local JSON file
- Increases YuniKorn queue capacity by
(nodes_added * node_cpu_capacity, nodes_added * node_memory_capacity)only after the nodes actually join the cluster
Periodically checks converted nodes:
- If a converted node has no active pods in its namespace, and it has been idle longer than
reversion_idle_seconds, the tool:- runs
/usr/site/rcac/sbin/node-convert --set batch --node-name <node> - waits for node to disappear from the K8s API
- decreases YuniKorn queue capacity accordingly
- removes it from the converted node registry
- runs
This tool treats YuniKorn as the source of truth for “how much capacity a queue is allowed to consume”.
It:
- Reads
queues.yaml (ConfigMap: yunikorn-configs by default) - Locates the queue by
queue_path(dot-separated path likeroot.platform.services.dev) - Adjusts:
resources.max.{vcore,memory}resources.guaranteed.{vcore,memory}
- Ensures
guaranteed <= maxafter changes
If a pod is Pending and has no declared requests/limits, this tool assigns defaults:
- CPU:
default_cpu_request_m(milliCPU) - Memory:
default_mem_request(parsed viakubernetes.utils.quantity.parse_quantity)
This is critical to avoid false “no demand” conditions.
A common failure mode: you have available queue headroom on paper, but the cluster cannot schedule the next pending pod because it needs a fresh node (fragmentation/binpack).
This tool estimates:
max_pend_cpu,max_pend_mem: largest pending pod “shape”pods_per_node: how many of that shape fit on a noderunning_pods_est: rough estimate of how many pods are already consuming the queue
If it detects a “perfectly filled” pattern with pending work waiting, it triggers conversion for at least one node.
The service runs two daemon threads:
Runs every check_interval_seconds:
trigger()
Computes needs per namespace and enqueues conversion tasks into a global queue.check_converted_nodes()
Finds idle converted nodes and reverts them.
Continuously:
- Checks global conversion cap
- Enforces cooldown
- Pops tasks from the global queue
- Performs node conversion (1 node per task)
- Updates YuniKorn capacity after nodes join
kubernetesPython clientpyyaml
/etc/dynamic-node/dynamic_node_config.ini
Main configuration./etc/dynamic-node/namespace_queue_paths.json
Map of namespace → YuniKorn queue path.
/usr/site/rcac/sbin/node-convert
Used to convert nodes between Slurm batch and Kubernetes.
Expected keys (section [settings]):
kubeconfig_path(optional)
If set, loads kubeconfig from this path. Otherwise uses default kubeconfig loading behavior.monitored_namespaces
Comma-separated namespaces to watch.reversion_idle_seconds
How long a converted node must be idle (no active pods) before revert.check_interval_seconds
Evaluate loop interval.max_converted_nodes
Global cap on how many nodes may be converted at any time.yunikorn_cm_namespace(default:yunikorn)
Namespace containing the YuniKorn ConfigMap.yunikorn_cm_name(default:yunikorn-configs)
Name of the YuniKorn ConfigMap containingqueues.yaml.yunikorn_queue_annotation(default:yunikorn.apache.org/queue)
Key used to identify queue assignment on pods (annotation or label).node_cpu_capacity(default:128)
vcores to add/subtract per converted node.node_memory_capacity(default:256G)
memory to add/subtract per converted node.global_cooldown_seconds(default:30)
Cooldown between conversions (note: present but not currently wired to advance_next_conversion_allowed_at).default_cpu_request_m(default:1000)
Used only when pending pods have no resource requests/limits.default_mem_request(default:2Gi)
Used only when pending pods have no resource requests/limits.
A JSON object mapping each monitored namespace to a YuniKorn queue path:
{
"workloads-dev": "root.platform.services.dev",
"workloads-prod": "root.platform.services.prod"
}This service supports two modes:
- monitor: run the continuous evaluate + actuator loops (normal operation)
- test: create a synthetic load Deployment in a namespace/queue (validation)
- A valid kubeconfig accessible to the process (either:
- set
kubeconfig_pathin/etc/dynamic-node/dynamic_node_config.ini, or - rely on default kubeconfig loading via
KUBECONFIG/~/.kube/config)
- set
/etc/dynamic-node/dynamic_node_config.iniis present and readable/etc/dynamic-node/namespace_queue_paths.jsonis present and contains queue mappings for all monitored namespaces- The process identity can:
- list nodes and pods
- read/patch the YuniKorn ConfigMap containing
queues.yaml - (test mode) create/patch/delete Deployments in the target namespace
- The external converter exists and is executable:
/usr/site/rcac/sbin/node-convert
Monitor mode starts two background threads:
- Evaluate loop: periodically computes demand and enqueues conversions; checks for idle nodes to revert
- Actuator loop: drains the conversion queue and performs
node-convertactions
python3 dynamic-node-manager.py --mode monitorTest mode creates a Deployment that schedules pods annotated to a specific YuniKorn queue. Each pod runs a long sleep to hold resources.
python3 dynamic-node-manager.py \
--mode test \
--namespace workloads-dev \
--queue root.platform.services.dev \
--replicas 200- Verify the manager is logging demand decisions and conversion actions via syslog.
- Confirm YuniKorn
queues.yamlis being updated (capacity changes show asMAX/GUARbumps in logs). - Confirm nodes join/leave the Kubernetes API after
node-convertruns. - Confirm
/etc/dynamic-node/converted_nodes.jsonis being updated as nodes are converted and reverted.