diff --git a/README.md b/README.md index bb4b11e7..526c8f37 100644 --- a/README.md +++ b/README.md @@ -15,12 +15,14 @@ Slurm is a highly configurable open source workload manager. See the [Slurm proj 7. [Slurm Job Accounting](#slurm-job-accounting) 1. [Cost Reporting](#cost-reporting) 8. [Topology](#topology) - 9. [GB200/GB300 IMEX Support](#gb200gb300-imex-support) + 9. [GB200/GB300 IMEX Support](#gb200gb300-imex-support) 10. [Setting KeepAlive in CycleCloud](#setting-keepalive) 11. [Slurmrestd](#slurmrestd) 12. [Node Health Checks](#node-health-checks) 13. [Monitoring](#monitoring) - 1. [Example Dashboards](#example-dashboards) + 1. [AzSlurm Exporter](#azslurm-exporter) + 1. [Exported Metrics](#exported-metrics) + 2. [Example Dashboards](#example-dashboards) 2. [Supported Slurm and PMIX versions](#supported-slurm-and-pmix-versions) 3. [Packaging](#packaging) 1. [Supported OS and PMC Repos](#supported-os-and-pmc-repos) @@ -40,27 +42,27 @@ Slurm is a highly configurable open source workload manager. See the [Slurm proj ### Making Cluster Changes In CycleCloud, cluster changes can be made using the "Edit" dialog from the cluster page in the GUI or from the CycleCloud CLI. Cluster topology changes, such as new partitions, generally require editing and re-importing the cluster template. This can be applied to live, running clusters as well as terminated clusters. It is also possible to import changes as a new Template for future cluster creation via the GUI. - + When updating a running cluster, some changes may need to be applied directly on the running nodes. Slurm clusters deployed by CycleCloud include a cli, available on the scheduler node, called `azslurm` which facilitates applying cluster configuration and scaling changes for running clusters. - + After making any changes to the running cluster, run the following command as root on the Slurm scheduler node to rebuild the `azure.conf` and update the nodes in the cluster: - + ``` $ sudo -i # azslurm scale ``` This should create the partitions with the correct number of nodes, the proper `gres.conf` and restart the `slurmctld`. - + For changes that are not available via the cluster's "Edit" dialog in the GUI, the cluster template must be customized. First, download a copy of the [Slurm cluster template](#templates/slurm.txt), if you do not have it. Then, to make template changes for a cluster you can perform the following commands using the cyclecloud cli. ``` # First update a copy of the slurm template (shown as ./MODIFIED_SLURM.txt below) - + cyclecloud export_parameters MY_CLUSTERNAME > ./MY_CLUSTERNAME.json cyclecloud import_cluster MY_CLUSTERNAME -c slurm -f ./MODIFIED_slurm.txt -p ./MY_CLUSTERNAME.json --force ``` For a terminated cluster you can go ahead and start the cluster with all changes in effect. - + **IMPORTANT: There is no need to terminate the cluster or scale down to apply changes.** To apply changes to a running/started cluster perform the following steps after you have completed the previous steps: @@ -112,7 +114,7 @@ PartitionName=mydynamic Nodes=mydynamicns ``` #### Using Dynamic Partitions to Autoscale -By default, we define no nodes in the dynamic partition. +By default, we define no nodes in the dynamic partition. You can pre-create node records like so, which allows Slurm to autoscale them up. ```bash @@ -170,7 +172,7 @@ To shutdown nodes, run `/opt/azurehpc/slurm/suspend_program.sh node_list` (e.g. To start a cluster in this mode, simply add `SuspendTime=-1` to the additional slurm config in the template. -To switch a cluster to this mode, add `SuspendTime=-1` to the slurm.conf and run `scontrol reconfigure`. Then run `azslurm remove_nodes && azslurm scale`. +To switch a cluster to this mode, add `SuspendTime=-1` to the slurm.conf and run `scontrol reconfigure`. Then run `azslurm remove_nodes && azslurm scale`. ### Slurm Job Accounting @@ -185,7 +187,7 @@ To setup job accounting, following fields are defined in the slurm cluster creat - *Database URL* - What this refers to is the "Database" URL, a DNS resolvable address (or an IP address) of where mysql database lives. -- *Database Name* - This is the database name that the Slurm Cluster will use. If this is not defined, then this is "clustername-acct-db". +- *Database Name* - This is the database name that the Slurm Cluster will use. If this is not defined, then this is "clustername-acct-db". Each cluster typically (when not defined) has its own database. This helps to not cause roll ups between starting clusters of different slurm versions. - *Database User* - This refers to the username slurmdbd will use to connect to MySQL Server. - *Database Password* - This refers to the password slurmdbd will use to connect to MySQL Server. @@ -358,13 +360,13 @@ Cyclecloud Slurm clusters now include prolog and epilog scripts to enable and cl slurm.imex.enabled=True or slurm.imex.enabled=False -``` +``` ### Setting KeepAlive Added in 4.0.5: If the KeepAlive attribute is set in the CycleCloud UI, then the azslurmd will add that node's name to the `SuspendExcNodes` attribute via scontrol. Note that it is required that `ReconfigFlags=KeepPowerSaveSettings` is set in the slurm.conf, as is the default as of 4.0.5. Once KeepALive is set back to false, `azslurmd` will then remove this node from `SuspendExcNodes`. -If a node is added to `SuspendExcNodes` either via `azslurm keep_alive` or via the scontrol command, then `azslurmd` will not remove this node from the `SuspendExcNodes` if KeepAlive is false in CycleCloud. However, if the node is later set to KeepAlive as true in the UI then `azslurmd` will then remove it from `SuspendExcNodes` when the node is set back to KeepAlive is false. +If a node is added to `SuspendExcNodes` either via `azslurm keep_alive` or via the scontrol command, then `azslurmd` will not remove this node from the `SuspendExcNodes` if KeepAlive is false in CycleCloud. However, if the node is later set to KeepAlive as true in the UI then `azslurmd` will then remove it from `SuspendExcNodes` when the node is set back to KeepAlive is false. ### Slurmrestd As of version 4.0.5, `slurmrestd` is automatically configured and started on the scheduler node and scheduler-ha node for all Slurm clusters. This REST API service provides programmatic access to Slurm functionality, allowing external applications and tools to interact with the cluster. For more information on the Slurm REST API, see the [official Slurm REST API documentation](https://slurm.schedmd.com/rest_api.html). @@ -432,8 +434,58 @@ To check if the configured exporters are exposing metrics, connect to a node and - For the DCGM Exporter : `curl -s http://localhost:9400/metrics` - only available on VM type with NVidia GPU - For the Slurm Exporter : `curl -s http://localhost:9200/metrics` - only available on the Slurm scheduler VM +#### AzSlurm Exporter + +The AzSlurm Exporter is a lightweight, asynchronous Prometheus exporter that runs on the Slurm scheduler node as a systemd service and exposes Slurm cluster metrics on port `9101` at the `/metrics` endpoint. It periodically queries cluster available CLI tools (`squeue`, `sacct`, `sinfo`, `azslurm`, `jetpack`), parses their output, and publishes metrics in Prometheus format for ingestion by Azure Monitor or any Prometheus-compatible monitoring system. + +If a collector binary is unavailable, that collector is skipped with a warning. The exporter only exits if **no** collectors initialize successfully. + +##### Exported Metrics + +**squeue metrics** + +| Metric | Type | Labels | Description | +|---|---|---|---| +| `squeue_partition_jobs_state` | Gauge | `partition`, `state` | Number of jobs in each state per partition | +| `squeue_job_nodes_allocated` | Gauge | `job_id`, `job_name`, `partition`, `state` | Nodes allocated to each running job | + +**sacct metrics** + +| Metric | Type | Labels | Description | +|---|---|---|---| +| `sacct_terminal_jobs` | Counter | `partition`, `exit_code`, `reason`, `state` | Cumulative count of completed/failed/cancelled jobs | + +Terminal states tracked: `completed`, `failed`, `cancelled`, `timeout`, `node_fail`, `preempted`, `out_of_memory`, `deadline`, `boot_fail`. Exit codes are mapped to human-readable reasons (e.g. `137:0` → `SIGKILL - Force killed`). + +**sinfo metrics** + +| Metric | Type | Labels | Description | +|---|---|---|---| +| `sinfo_partition_nodes_state` | Gauge | `node_list`, `partition`, `state`, `reason` | Number of nodes in each state per partition | + +Node state suffixes (e.g. `*` = not responding, `~` = powered off, `#` = powering up) are normalized to descriptive state names. + +**azslurm metrics** + +| Metric | Type | Labels | Description | +|---|---|---|---| +| `azslurm_partition_info` | Gauge | `partition`, `nodelist`, `vm_size`, `azure_count` | Available node count per partition. The gauge value is the `available_count` for the partition, and the `azure_count` label reflects the minimum of family and regional quota availability. | + +The `azslurm` collector queries `azslurm partitions` and `azslurm limits` to combine partition-to-nodelist mappings with Azure quota and VM availability information, providing visibility into how many nodes can actually be provisioned for each partition. + +**jetpack metrics** + +| Metric | Type | Labels | Description | +|---|---|---|---| +| `jetpack_cluster_info` | Gauge | `region` | Cluster metadata exposing the Azure region where the cluster is deployed. Always set to `1` as an info-style metric. | + +The `jetpack` collector queries `jetpack config` to retrieve the Azure region from the VM's compute metadata. It runs infrequently (default: every 24 hours) since this value does not change during the lifetime of a cluster. + #### Example Dashboards +**AzSlurm Dashboard** +![Alt](/images/azslurmexporterdash.png "Example AzSlurm Exporter Grafana Dashboard") + **Slurm Dashboard** ![Alt](/images/slurmexporterdash.png "Example Slurm Exporter Grafana Dashboard") *Note: this dashboard is not published with cyclecloud-monitoring project and is used here as an example* @@ -538,7 +590,7 @@ Nov 18 17:51:58 rc403-hpc-1 slurmd[8046]: [2025-11-18T17:51:58.002] error: Secur For some regions and VM sizes, some subscriptions may report an incorrect number of GPUs. This value is controlled in `/opt/azurehpc/slurm/autoscale.json` -The default definition looks like the following: +The default definition looks like the following: ```json "default_resources": [ { @@ -575,7 +627,7 @@ Slurm requires that you define the amount of free memory, after OS/Applications To change this dampening, there are two options. 1) You can define `slurm.dampen_memory=X` where X is an integer percentage (5 == 5%) -2) Create a default_resource definition in the /opt/azurehpc/slurm/autoscale.json file. +2) Create a default_resource definition in the /opt/azurehpc/slurm/autoscale.json file. ```json "default_resources": [ { @@ -618,8 +670,8 @@ This will change the behavior of the `azslurm return_to_idle` command that is, b 3. `cyclecloud_slurm.sh` no longer exists. Instead there is the azslurm cli, which can be run as root. `azslurm` uses autocomplete. ```bash [root@scheduler ~]# azslurm - usage: - accounting_info - + usage: + accounting_info - buckets - Prints out autoscale bucket information, like limits etc config - Writes the effective autoscale config, after any preprocessing, to stdout connect - Tests connection to CycleCloud @@ -627,7 +679,7 @@ This will change the behavior of the `azslurm return_to_idle` command that is, b default_output_columns - Output what are the default output columns for an optional command. initconfig - Creates an initial autoscale config. Writes to stdout keep_alive - Add, remeove or set which nodes should be prevented from being shutdown. - limits - + limits - nodes - Query nodes partitions - Generates partition configuration refresh_autocomplete - Refreshes local autocomplete information for cluster specific resources and nodes. @@ -635,7 +687,7 @@ This will change the behavior of the `azslurm return_to_idle` command that is, b resume - Equivalent to ResumeProgram, starts and waits for a set of nodes. resume_fail - Equivalent to SuspendFailProgram, shutsdown nodes retry_failed_nodes - Retries all nodes in a failed state. - scale - + scale - shell - Interactive python shell with relevant objects in local scope. Use --script to run python scripts suspend - Equivalent to SuspendProgram, shutsdown nodes topology - Generates topology plugin configuration diff --git a/azure-slurm-exporter/add_dashboards.sh b/azure-slurm-exporter/add_dashboards.sh new file mode 100755 index 00000000..39b659d0 --- /dev/null +++ b/azure-slurm-exporter/add_dashboards.sh @@ -0,0 +1,27 @@ +#!/bin/bash +EXPORTER_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" +echo "Exporter directory: $EXPORTER_DIR" +RESOURCE_GROUP_NAME=$1 +GRAFANA_NAME=$2 + +if [ -z "$GRAFANA_NAME" ]; then + echo "Usage: $0 " + exit 1 +fi +if [ -z "$RESOURCE_GROUP_NAME" ]; then + echo "Usage: $0 " + exit 1 +fi + +FOLDER_NAME="Azure CycleCloud" +DASHBOARD_FOLDER=$EXPORTER_DIR/dashboards +# Create Grafana dashboards folders +az grafana folder show -n $GRAFANA_NAME -g $RESOURCE_GROUP_NAME --folder "$FOLDER_NAME" > /dev/null 2>&1 +if [ $? -ne 0 ]; then + echo "$FOLDER_NAME folder does not exist. Creating it." + az grafana folder create --name $GRAFANA_NAME --resource-group $RESOURCE_GROUP_NAME --title "$FOLDER_NAME" +fi + +# Slurm Dashboard +az grafana dashboard import --name $GRAFANA_NAME --resource-group $RESOURCE_GROUP_NAME --folder "$FOLDER_NAME" --overwrite true --definition $DASHBOARD_FOLDER/azslurm.json +az grafana dashboard import --name $GRAFANA_NAME --resource-group $RESOURCE_GROUP_NAME --folder "$FOLDER_NAME" --overwrite true --definition $DASHBOARD_FOLDER/failed-jobs.json \ No newline at end of file diff --git a/azure-slurm-exporter/dashboards/azslurm.json b/azure-slurm-exporter/dashboards/azslurm.json new file mode 100644 index 00000000..acc4dee4 --- /dev/null +++ b/azure-slurm-exporter/dashboards/azslurm.json @@ -0,0 +1,2741 @@ +{ + "annotations": { + "list": [ + { + "$$hashKey": "object:1345", + "builtIn": 1, + "datasource": { + "type": "datasource", + "uid": "grafana" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "description": "Slurm Cluster Mission Control Dashboard", + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 2, + "id": null, + "links": [], + "panels": [ + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 1040, + "panels": [], + "title": "Cluster Specs", + "type": "row" + }, + { + "datasource": { + "uid": "${promDatasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "inspect": false + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 10, + "x": 0, + "y": 1 + }, + "id": 1041, + "options": { + "cellHeight": "sm", + "footer": { + "countRows": false, + "fields": "", + "reducer": [ + "sum" + ], + "show": false + }, + "showHeader": true + }, + "pluginVersion": "11.6.9", + "targets": [ + { + "editorMode": "code", + "exemplar": false, + "expr": "jetpack_cluster_info{cluster=\"$cluster\"}", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Cluster Specs", + "transformations": [ + { + "id": "labelsToFields", + "options": { + "keepLabels": [ + "subscription", + "cluster", + "region" + ], + "mode": "rows" + } + }, + { + "id": "sortBy", + "options": { + "fields": {}, + "sort": [ + { + "field": "label" + } + ] + } + }, + { + "id": "organize", + "options": { + "excludeByName": {}, + "includeByName": {}, + "indexByName": {}, + "renameByName": { + "label": "Spec", + "value": "Value" + } + } + } + ], + "type": "table" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 8, + "x": 10, + "y": 1 + }, + "id": 1042, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "value_and_name", + "wideLayout": false + }, + "pluginVersion": "11.6.9", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "expr": "sum by (partition)(scontrol_partition_nodes{cluster=\"$cluster\",partition=~\"$partition\"})", + "hide": true, + "instant": false, + "legendFormat": "{{partition}}", + "range": true, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "expr": "sum by (cluster)(sinfo_partition_nodes_state{cluster=\"$cluster\"})", + "hide": false, + "instant": false, + "interval": "", + "legendFormat": "Total Slurm Nodes", + "range": true, + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "expr": "sum by (partition)(sinfo_partition_nodes_state{cluster=\"$cluster\",partition=~\"$partition\"})", + "hide": false, + "instant": false, + "legendFormat": "{{partition}}", + "range": true, + "refId": "C" + } + ], + "title": "Total Nodes by Partition $partition", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "filterable": false, + "inspect": false, + "minWidth": 150 + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 9, + "w": 18, + "x": 0, + "y": 9 + }, + "id": 1043, + "options": { + "cellHeight": "lg", + "footer": { + "countRows": false, + "enablePagination": false, + "fields": "", + "reducer": [ + "sum" + ], + "show": false + }, + "frameIndex": 1, + "showHeader": true, + "sortBy": [ + { + "desc": true, + "displayName": "nodelist" + } + ] + }, + "pluginVersion": "11.6.9", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "azslurm_partition_info{cluster=\"$cluster\",partition=~\"$partition\"}", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Partition Specs", + "transformations": [ + { + "id": "timeSeriesTable", + "options": { + "A": { + "stat": "lastNotNull", + "timeField": "Time" + } + } + }, + { + "id": "labelsToFields", + "options": { + "mode": "columns" + } + }, + { + "id": "organize", + "options": { + "excludeByName": { + "Time": true, + "__name__": true, + "cluster": true, + "instance": true, + "job": true, + "physical_host": true, + "subscription": true + }, + "includeByName": {}, + "indexByName": { + "Trend #A": 9, + "__name__": 0, + "azure_count": 10, + "cluster": 1, + "instance": 2, + "job": 3, + "nodelist": 8, + "partition": 4, + "physical_host": 5, + "subscription": 6, + "vm_size": 7 + }, + "renameByName": { + "Metric": "", + "Trend #A": "Available Count", + "available_azure_quota": "Available Azure Quota", + "azure_count": "Available Azure Count", + "node_list": "Node List", + "partition": "Partition", + "vm_size": "VM Size" + } + } + } + ], + "type": "table" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 18 + }, + "id": 1027, + "panels": [], + "title": "Cluster Overview for partion $partition", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Running Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Configuring Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "semi-dark-yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Completed Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "blue", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "" + }, + "properties": [] + }, + { + "matcher": { + "id": "byName", + "options": "Cancelled Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "semi-dark-orange", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Node Failed Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Failed Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "red", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 8, + "w": 9, + "x": 0, + "y": 19 + }, + "id": 2, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "last" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "text": {}, + "textMode": "auto", + "wideLayout": false + }, + "pluginVersion": "11.6.9", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum by (state)(squeue_partition_jobs_state{cluster=\"$cluster\",partition=~\"$partition\"})", + "hide": false, + "instant": true, + "legendFormat": "{{state}}", + "range": false, + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum by (cluster)(label_replace(squeue_partition_jobs_state{cluster=\"$cluster\",partition=~\"$partition\"}, \"total\", \"Submitted Jobs\", \"\", \"\"))", + "hide": false, + "instant": true, + "legendFormat": "Submitted Jobs", + "range": false, + "refId": "C" + } + ], + "title": "squeue snapshot", + "transformations": [ + { + "disabled": true, + "filter": { + "id": "byRefId", + "options": "/^(?:F)$/" + }, + "id": "filterFieldsByName", + "options": { + "byVariable": true, + "include": { + "variable": "$partition" + } + }, + "topic": "annotations" + } + ], + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 1, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "never", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Running Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Configuring Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "semi-dark-yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Completed Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "blue", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "" + }, + "properties": [] + }, + { + "matcher": { + "id": "byName", + "options": "Cancelled Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "semi-dark-orange", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Node Failed Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Submitted Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-blue", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Jobs in Queue" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "blue", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Failed Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "running" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "configuring" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "completing" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-purple", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "pending" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "orange", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "submitted" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "text", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 16, + "w": 10, + "x": 9, + "y": 19 + }, + "id": 1037, + "options": { + "legend": { + "calcs": [ + "lastNotNull" + ], + "displayMode": "table", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "multi", + "sort": "none" + } + }, + "pluginVersion": "11.6.9", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "expr": "sum by (state)(squeue_partition_jobs_state{cluster=\"$cluster\",partition=~\"$partition\"})", + "hide": false, + "instant": false, + "legendFormat": "{{state}}", + "range": true, + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "expr": "sum by (cluster)(label_replace(squeue_partition_jobs_state{cluster=\"$cluster\",partition=~\"$partition\"}, \"total\", \"Submitted Jobs\", \"\", \"\"))", + "hide": false, + "instant": false, + "legendFormat": "submitted", + "range": true, + "refId": "A" + } + ], + "title": "Job Queue Overview", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "description": "#of of allocated nodes over all powered up nodes", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red" + }, + { + "color": "yellow", + "value": 50 + }, + { + "color": "green", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 5, + "x": 19, + "y": 19 + }, + "id": 1049, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "11.6.9", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "expr": "(\r\n sum(sinfo_partition_nodes_state{cluster=\"$cluster\", partition=~\"$partition\", state=~\"allocated|mixed|draining\"})\r\n /\r\n (\r\n sum(sinfo_partition_nodes_state{cluster=\"$cluster\", partition=~\"$partition\"})\r\n -\r\n sum(sinfo_partition_nodes_state{cluster=\"$cluster\", partition=~\"$partition\", state=~\"powered_off\"})\r\n )\r\n) * 100", + "hide": false, + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "C" + } + ], + "title": "Active Cluster Utilization %", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "description": "#of of allocated nodes over all available nodes", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red" + }, + { + "color": "yellow", + "value": 50 + }, + { + "color": "green", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 5, + "x": 19, + "y": 26 + }, + "id": 1050, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "11.6.9", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "expr": "(\r\n sum(sinfo_partition_nodes_state{cluster=\"$cluster\", partition=~\"$partition\", state=~\"allocated|mixed|draining\"})\r\n /\r\n (\r\n sum(sinfo_partition_nodes_state{cluster=\"$cluster\", partition=~\"$partition\"})\r\n\r\n )\r\n) * 100", + "hide": false, + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "Overall Cluster Utilization %", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "filterable": true, + "inspect": false + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 9, + "x": 0, + "y": 27 + }, + "id": 1047, + "options": { + "cellHeight": "sm", + "footer": { + "countRows": false, + "fields": "", + "reducer": [ + "sum" + ], + "show": false + }, + "frameIndex": 3, + "showHeader": true, + "sortBy": [ + { + "desc": true, + "displayName": "Number of Allocated Nodes" + } + ] + }, + "pluginVersion": "11.6.9", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "topk(10,squeue_job_nodes_allocated{cluster=\"$cluster\",partition=~\"$partition\"})", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Current Running Jobs by Node Allocation", + "transformations": [ + { + "id": "timeSeriesTable", + "options": {} + }, + { + "id": "organize", + "options": { + "excludeByName": { + "__name__": true, + "cluster": true, + "instance": true, + "job": true, + "physical_host": true, + "state": true, + "subscription": true + }, + "includeByName": {}, + "indexByName": {}, + "renameByName": { + "Trend #A": "Number of Allocated Nodes" + } + } + }, + { + "id": "sortBy", + "options": { + "fields": {}, + "sort": [ + { + "field": "Number of Allocated Nodes" + } + ] + } + } + ], + "type": "table" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + } + }, + "fieldMinMax": false, + "mappings": [] + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "IDLE" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "purple", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "ALLOCATED" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "DRAINING" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "DOWN" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "MAINTENANCE" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "super-light-red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "UNKNOWN" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "#ff00f6", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Total" + }, + "properties": [ + { + "id": "custom.hideFrom", + "value": { + "legend": false, + "tooltip": false, + "viz": true + } + }, + { + "id": "color", + "value": { + "fixedColor": "transparent", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "POWERING UP" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "blue", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "POWERED DOWN" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "#767472", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "MIXED" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "DRAIN" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "DRAINED" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-orange", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "RESERVED" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-blue", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "powered_off" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "#918f8fa6", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "planned_backfill" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "allocated" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "n/a" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "transparent", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "mixed" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-orange", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 10, + "w": 9, + "x": 0, + "y": 35 + }, + "id": 1035, + "options": { + "displayLabels": [ + "percent", + "value", + "name" + ], + "legend": { + "displayMode": "table", + "placement": "right", + "showLegend": true, + "sortBy": "Value", + "sortDesc": true, + "values": [ + "value" + ] + }, + "pieType": "pie", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "tooltip": { + "hideZeros": true, + "mode": "multi", + "sort": "none" + } + }, + "pluginVersion": "11.6.9", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum by (state)(sinfo_partition_nodes_state{cluster=\"$cluster\",partition=~\"$partition\"})", + "hide": false, + "instant": true, + "legendFormat": "{{state}}", + "range": false, + "refId": "D" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum by (cluster)(sinfo_partition_nodes_state{cluster=\"$cluster\",partition=~\"$partition\"})", + "hide": false, + "instant": true, + "legendFormat": "Total", + "range": false, + "refId": "A" + } + ], + "title": "Live: Node status", + "type": "piechart" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "fixed" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 37, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 2, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "normal" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "fieldMinMax": false, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "IDLE" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "purple", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "ALLOCATED" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "DRAINING" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "DOWN" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "MAINTENANCE" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "super-light-red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "UNKNOWN" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "#ff00f6", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "TOTAL" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "transparent", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "POWERED DOWN" + }, + "properties": [ + { + "id": "color", + "value": { + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "UNRESPONSIVE" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "POWERED UP" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "POWERING UP" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "blue", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "TOTAL" + }, + "properties": [ + { + "id": "custom.hideFrom", + "value": { + "legend": false, + "tooltip": false, + "viz": true + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "MIXED" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "DRAIN" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "DRAINED" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-orange", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "RESERVED" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-blue", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "FAIL" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "allocated" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "semi-dark-green", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "idle" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "purple", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "powering_up" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "blue", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "completing" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-purple", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "planned_backfill" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "mixed" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-orange", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "n/a" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "transparent", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 18, + "w": 10, + "x": 9, + "y": 35 + }, + "id": 1034, + "options": { + "legend": { + "calcs": [ + "lastNotNull" + ], + "displayMode": "table", + "placement": "bottom", + "showLegend": true, + "sortBy": "Last *", + "sortDesc": true + }, + "tooltip": { + "hideZeros": true, + "mode": "multi", + "sort": "asc" + } + }, + "pluginVersion": "11.6.9", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "expr": "sum by (state)(sinfo_partition_nodes_state{cluster=\"$cluster\",partition=~\"$partition\"})", + "hide": false, + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "Live: Node status", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "filterable": true, + "inspect": false + }, + "fieldMinMax": false, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "IDLE" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "purple", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "ALLOCATED" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "DRAINING" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "DOWN" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "MAINTENANCE" + }, + "properties": [ + { + "id": "color", + "value": { + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "UNKNOWN" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "#ff00f6", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "TOTAL" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "transparent", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "POWERING UP" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "blue", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "POWERED DOWN" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "#767472", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "MIXED" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "DRAIN" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "partition" + }, + "properties": [ + { + "id": "custom.width", + "value": 94 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "nodelist" + }, + "properties": [ + { + "id": "custom.width", + "value": 229 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "reason" + }, + "properties": [ + { + "id": "custom.width", + "value": 96 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "node_list" + }, + "properties": [ + { + "id": "custom.width", + "value": 169 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "state" + }, + "properties": [ + { + "id": "custom.width", + "value": 122 + } + ] + } + ] + }, + "gridPos": { + "h": 8, + "w": 9, + "x": 0, + "y": 45 + }, + "id": 1051, + "options": { + "cellHeight": "sm", + "footer": { + "countRows": false, + "fields": "", + "reducer": [ + "sum" + ], + "show": false + }, + "frameIndex": 2, + "showHeader": true, + "sortBy": [ + { + "desc": true, + "displayName": "node_list" + } + ] + }, + "pluginVersion": "11.6.9", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sinfo_partition_nodes_state{cluster=\"$cluster\",partition=~\"$partition\"}", + "hide": false, + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "B" + } + ], + "title": "Nodelist by State", + "transformations": [ + { + "id": "timeSeriesTable", + "options": {} + }, + { + "id": "labelsToFields", + "options": { + "mode": "columns" + } + }, + { + "id": "organize", + "options": { + "excludeByName": { + "__name__": true, + "cluster": true, + "instance": true, + "job": true, + "physical_host": true, + "subscription": true + }, + "includeByName": {}, + "indexByName": { + "Trend #A": 10, + "__name__": 1, + "cluster": 2, + "instance": 3, + "job": 4, + "node_list": 6, + "partition": 5, + "physical_host": 7, + "reason": 8, + "state": 0, + "subscription": 9 + }, + "renameByName": { + "Trend #A": "count", + "Trend #B": "count" + } + } + } + ], + "type": "table" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 53 + }, + "id": 1044, + "panels": [], + "title": "Total Finished Jobs for Partition $partition", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + } + }, + "mappings": [] + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Running Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Configuring Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "semi-dark-yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Completed Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "" + }, + "properties": [] + }, + { + "matcher": { + "id": "byName", + "options": "Cancelled Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "semi-dark-orange", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Node Failed Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Total Finished Jobs" + }, + "properties": [ + { + "id": "custom.hideFrom", + "value": { + "legend": false, + "tooltip": false, + "viz": true + } + }, + { + "id": "color", + "value": { + "fixedColor": "transparent", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Failed Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "red", + "mode": "fixed" + } + }, + { + "id": "links", + "value": [ + { + "targetBlank": true, + "title": "Failed Jobs in last 6 months", + "url": "/d/cff0f6w0qqwaoa/failed-jobs-dashboard?orgId=1&from=${__from}&to=${__to}&timezone=browser&${promDatasource:queryparam}&${partition:queryparam}&${cluster:queryparam}&viewPanel=panel-1050" + } + ] + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Timed Out Jobs" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "yellow", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 10, + "w": 10, + "x": 0, + "y": 54 + }, + "id": 1039, + "options": { + "displayLabels": [ + "name", + "value" + ], + "legend": { + "displayMode": "table", + "placement": "bottom", + "showLegend": true, + "values": [ + "value", + "percent" + ] + }, + "pieType": "pie", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "tooltip": { + "hideZeros": false, + "mode": "multi", + "sort": "none" + } + }, + "pluginVersion": "11.6.9", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum by (state)(sacct_terminal_jobs_total{cluster=\"$cluster\",partition=~\"$partition\",state=\"completed\"})", + "hide": false, + "instant": false, + "legendFormat": "Completed Jobs", + "range": true, + "refId": "C" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "expr": "sum by (failed)(label_replace(sacct_terminal_jobs_total{cluster=\"$cluster\",partition=~\"$partition\",state!=\"completed\"},\"failed\",\"Failed Jobs\",\"\",\"\"))", + "hide": false, + "instant": false, + "legendFormat": "Failed Jobs", + "range": true, + "refId": "D" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "expr": "sum by (cluster)(sacct_terminal_jobs_total{cluster=\"$cluster\",partition=~\"$partition\"})", + "hide": false, + "instant": false, + "legendFormat": "Total Finished Jobs", + "range": true, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "expr": "sum by (state)(increase(sacct_terminal_jobs_total{cluster=\"$cluster\",partition=~\"$partition\",state=\"timeout\"}[$__range]))", + "hide": true, + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(increase(sacct_terminal_jobs_total{cluster=\"$cluster\",partition=~\"$partition\", state=\"completed\"}[$__range]))", + "hide": true, + "instant": false, + "legendFormat": "Total Finished Jobs", + "range": true, + "refId": "E" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "expr": "sacct_terminal_jobs_total{cluster=\"$cluster\",partition=~\"$partition\",state=\"timeout\"}", + "hide": true, + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "F" + } + ], + "title": "Total Finished Jobs", + "type": "piechart" + } + ], + "preload": false, + "refresh": "5s", + "schemaVersion": 41, + "tags": [], + "templating": { + "list": [ + { + "current": { + "text": "cbbe2034-c78b-4e9b-89b4-8b78530247e5", + "value": "cbbe2034-c78b-4e9b-89b4-8b78530247e5" + }, + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "definition": "label_values(jetpack_cluster_info,subscription)", + "includeAll": false, + "label": "Subscription", + "name": "Subscription", + "options": [], + "query": { + "qryType": 1, + "query": "label_values(jetpack_cluster_info,subscription)", + "refId": "PrometheusVariableQueryEditor-VariableQuery" + }, + "refresh": 1, + "regex": "", + "type": "query" + }, + { + "current": { + "text": "Managed_Prometheus_ccw-mon-xfnnjso7smaw6", + "value": "ccw-mon-xfnnjso7smaw6" + }, + "includeAll": false, + "label": "Prometheus Data Source", + "name": "promDatasource", + "options": [], + "query": "prometheus", + "refresh": 1, + "regex": "", + "type": "datasource" + }, + { + "current": { + "text": "All", + "value": "$__all" + }, + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "definition": "label_values(sinfo_partition_nodes_state, partition)", + "includeAll": true, + "multi": true, + "name": "partition", + "options": [], + "query": { + "qryType": 5, + "query": "label_values(sinfo_partition_nodes_state, partition)", + "refId": "PrometheusVariableQueryEditor-VariableQuery" + }, + "refresh": 1, + "regex": "", + "type": "query" + }, + { + "current": { + "text": "azcyclecloudwesteu-rg/azslurm-exporter-34-copy", + "value": "azcyclecloudwesteu-rg/azslurm-exporter-34-copy" + }, + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "definition": "label_values(node_cpu_seconds_total, cluster)", + "name": "cluster", + "options": [], + "query": { + "qryType": 5, + "query": "label_values(node_cpu_seconds_total, cluster)", + "refId": "PrometheusVariableQueryEditor-VariableQuery" + }, + "refresh": 1, + "regex": "", + "type": "query" + } + ] + }, + "time": { + "from": "now-5m", + "to": "now" + }, + "timepicker": {}, + "timezone": "", + "title": "AzSlurm Dashboard", + "uid": "aff0elgh9i0hsa", + "version": 1 +} \ No newline at end of file diff --git a/azure-slurm-exporter/dashboards/failed-jobs.json b/azure-slurm-exporter/dashboards/failed-jobs.json new file mode 100644 index 00000000..a917aa92 --- /dev/null +++ b/azure-slurm-exporter/dashboards/failed-jobs.json @@ -0,0 +1,348 @@ +{ + "annotations": { + "list": [ + { + "$$hashKey": "object:1345", + "builtIn": 1, + "datasource": { + "type": "datasource", + "uid": "grafana" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "description": "Failed Jobs Overview", + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "id": null, + "links": [], + "panels": [ + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 1047, + "panels": [], + "title": "Failed Jobs", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "filterable": true, + "inspect": false, + "minWidth": 150 + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 11, + "w": 16, + "x": 0, + "y": 1 + }, + "id": 1050, + "options": { + "cellHeight": "lg", + "footer": { + "countRows": false, + "enablePagination": false, + "fields": "", + "reducer": [ + "sum" + ], + "show": false + }, + "frameIndex": 1, + "showHeader": true, + "sortBy": [] + }, + "pluginVersion": "11.6.9", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sacct_terminal_jobs_total{cluster=\"$cluster\",partition=~\"$partition\", state!=\"completed\"}", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Failed Jobs by Exit Code", + "transformations": [ + { + "id": "timeSeriesTable", + "options": {} + }, + { + "disabled": true, + "id": "labelsToFields", + "options": { + "keepLabels": [ + "exit_code", + "partition", + "state" + ], + "mode": "columns" + } + }, + { + "id": "organize", + "options": { + "excludeByName": { + "__name__": true, + "cluster": true, + "instance": true, + "job": true, + "physical_host": true, + "start_datetime hpc": true, + "subscription": true + }, + "includeByName": {}, + "indexByName": { + "Trend #A": 3, + "__name__": 0, + "cluster": 1, + "exit_code": 5, + "instance": 6, + "job": 7, + "partition": 2, + "physical_host": 8, + "start_datetime": 9, + "state": 4, + "subscription": 10 + }, + "renameByName": { + "Trend #A": "# of Failed Jobs" + } + } + } + ], + "type": "table" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 5, + "w": 16, + "x": 0, + "y": 12 + }, + "id": 1051, + "options": { + "colorMode": "value", + "graphMode": "none", + "justifyMode": "center", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "11.6.9", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "expr": "sum(floor(increase(sacct_terminal_jobs_total{cluster=\"$cluster\",partition=~\"$partition\"}[$__range])))", + "hide": true, + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "editorMode": "code", + "expr": "sum(sacct_terminal_jobs_total{cluster=\"$cluster\",partition=~\"$partition\", state!=\"completed\"})", + "hide": false, + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "B" + } + ], + "title": "Failed Jobs by Exit Code", + "type": "stat" + } + ], + "preload": false, + "refresh": "", + "schemaVersion": 41, + "tags": [], + "templating": { + "list": [ + { + "current": { + "text": "cbbe2034-c78b-4e9b-89b4-8b78530247e5", + "value": "cbbe2034-c78b-4e9b-89b4-8b78530247e5" + }, + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "definition": "label_values(jetpack_cluster_info,subscription)", + "includeAll": false, + "label": "Subscription", + "name": "Subscription", + "options": [], + "query": { + "qryType": 1, + "query": "label_values(jetpack_cluster_info,subscription)", + "refId": "PrometheusVariableQueryEditor-VariableQuery" + }, + "refresh": 1, + "regex": "", + "type": "query" + }, + { + "current": { + "text": "Managed_Prometheus_ccw-mon-xfnnjso7smaw6", + "value": "ccw-mon-xfnnjso7smaw6" + }, + "includeAll": false, + "label": "Prometheus Data Source", + "name": "promDatasource", + "options": [], + "query": "prometheus", + "refresh": 1, + "regex": "", + "type": "datasource" + }, + { + "current": { + "text": "All", + "value": [ + "$__all" + ] + }, + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "definition": "label_values(sinfo_partition_nodes_state, partition)", + "includeAll": true, + "multi": true, + "name": "partition", + "options": [], + "query": { + "qryType": 5, + "query": "label_values(sinfo_partition_nodes_state, partition)", + "refId": "PrometheusVariableQueryEditor-VariableQuery" + }, + "refresh": 1, + "regex": "", + "type": "query" + }, + { + "current": { + "text": "azcyclecloudwesteu-rg/azslurm-exporter-34-copy", + "value": "azcyclecloudwesteu-rg/azslurm-exporter-34-copy" + }, + "datasource": { + "type": "prometheus", + "uid": "${promDatasource}" + }, + "definition": "label_values(node_cpu_seconds_total, cluster)", + "name": "cluster", + "options": [], + "query": { + "qryType": 5, + "query": "label_values(node_cpu_seconds_total, cluster)", + "refId": "PrometheusVariableQueryEditor-VariableQuery" + }, + "refresh": 1, + "regex": "", + "type": "query" + } + ] + }, + "time": { + "from": "now-5m", + "to": "now" + }, + "timepicker": {}, + "timezone": "", + "title": "Failed Jobs Dashboard", + "uid": "cff0f6w0qqwaoa", + "version": 1 +} \ No newline at end of file diff --git a/azure-slurm-exporter/exporter/__init__.py b/azure-slurm-exporter/exporter/__init__.py new file mode 100644 index 00000000..1f697c83 --- /dev/null +++ b/azure-slurm-exporter/exporter/__init__.py @@ -0,0 +1,3 @@ +from .exporter import main + +__all__ = ["main"] \ No newline at end of file diff --git a/azure-slurm-exporter/exporter/azslurm.py b/azure-slurm-exporter/exporter/azslurm.py new file mode 100644 index 00000000..dc2f0e57 --- /dev/null +++ b/azure-slurm-exporter/exporter/azslurm.py @@ -0,0 +1,108 @@ +from .exporter import BaseCollector +from collections import namedtuple +from prometheus_client import Gauge +import json +import logging +import exporter.util as util +import re +from typing import List +log = logging.getLogger(__name__) + +class AzslurmNotAvailException(Exception): + pass + +class Azslurm(BaseCollector): + + def __init__(self, binary_path="/root/bin/azslurm", interval=300, timeout=120): + self.binary_path = binary_path + self.interval = interval + self.timeout = timeout + self.partition_output = namedtuple("partition_output", ["partition", "node_list"]) + self.limits_default_fmt = ["nodearray","vm_size","available_count","family_available_count","regional_available_count"] + self.limits_output = namedtuple("limits_output",self.limits_default_fmt) + self.cached_output = {"azslurm_metrics":[]} + + def initialize(self) -> None: + """ + Initialize the Azslurm instance by validating the binary + """ + if not util.is_file_binary(self.binary_path): + log.error(f"{self.binary_path} is not a file or not executable") + raise AzslurmNotAvailException + + def start(self) -> None: + """ + Begin collecting metrics asynchronously and runs it at regular + intervals as defined by the configured downstream interval. + """ + self.launch_task(func=self.azslurm_query, interval=self.interval) + + def export_metrics(self) -> List[Gauge]: + """ + Return metrics in Prometheus-compatible format from cache. + """ + #TODO: DO we need to lock this? + return self.cached_output["azslurm_metrics"] + + def parse_output(self, partitions_stdout, limits_stdout) -> None: + """ + Parse azslurm command stdout and return prometheus gauge for partition specs + """ + azslurm_partition_info = Gauge("azslurm_partition_info", "Partition specs for cluster and available nodecount for each partitions", + labelnames=["partition", "nodelist", "vm_size", "azure_count"], + registry=None) + + nodelist_map = self._parse_partitions(partitions_stdout) + for row in self._parse_limits(limits_stdout): + nodelist = nodelist_map.get(row.nodearray, "") + azslurm_partition_info.labels(partition=row.nodearray, + nodelist=nodelist, + vm_size=row.vm_size, + azure_count=min(int(row.family_available_count), int(row.regional_available_count)) + ).set(int(row.available_count)) + return [azslurm_partition_info] + + def _parse_partitions(self, stdout) -> dict: + """ + Parse azslurm partitions output into {nodearray: nodelist} map. + """ + node_list_map = {} + partition_name = None + for line in stdout.decode().strip().splitlines(): + if line.startswith("#"): + continue + kv = dict(re.findall(r'(\w+)=(\S+)', line)) + if "PartitionName" in kv: + partition_name = kv["PartitionName"] + if "Nodename" in kv and partition_name: + node_list_map[partition_name] = kv["Nodename"] + partition_name = None + return node_list_map + + def _parse_limits(self, stdout): + """ + Parse azslurm limits JSON output into namedtuples. + """ + for entry in json.loads(stdout.decode()): + yield self.limits_output._make(entry[f] for f in self.limits_default_fmt) + + async def azslurm_query(self) -> None: + """ + Query azslurm partitions and limits command and save parsed output in prometheus metrics + format to cache + """ + args_partitions = [self.binary_path] + args_limits = [self.binary_path] + args_partitions.extend(["partitions"]) + args_limits.extend(["limits"]) + + try: + proc_partitions = await self.run_command(timeout=self.timeout,*args_partitions) + proc_limits = await self.run_command(timeout=self.timeout,*args_limits) + except Exception as e: + log.error(e) + return + + self.cached_output["azslurm_metrics"] = self.parse_output(proc_partitions.stdout, proc_limits.stdout) + + diff --git a/azure-slurm-exporter/exporter/exporter.py b/azure-slurm-exporter/exporter/exporter.py new file mode 100644 index 00000000..6763ab56 --- /dev/null +++ b/azure-slurm-exporter/exporter/exporter.py @@ -0,0 +1,218 @@ +import asyncio +import logging +import logging.config +import signal +import sys +import time +import os +from importlib import resources +from prometheus_client import CollectorRegistry, Metric, Counter, Gauge, Summary +from abc import ABC, abstractmethod +from functools import partial +from collections import namedtuple +from aiohttp import web +from prometheus_client.aiohttp import make_aiohttp_handler +from typing import Iterator, List, Union + +log = logging.getLogger("root") +CommandResult = namedtuple("CommandResult", ["returncode", "stdout", "stderr"]) + +class NoCollectorsFoundException(Exception): + pass + +class HTTPServerFailedException(Exception): + pass + +class BaseCollector(ABC): + @abstractmethod + async def start(self) -> None: + """ + Begin collecting metrics asynchronously and runs it at regular + intervals as defined by the configured downstream interval. + """ + ... + + @abstractmethod + def export_metrics(self) -> List[Union[Gauge,Counter,Summary]]: + """ + Return metrics in Prometheus-compatible format. + """ + ... + + async def run_command(self, *args, timeout=120,) -> CommandResult: + """ + Executes a command asynchronously in a subprocess, capturing both stdout + and stderr, with support for timeout handling. + """ + start = time.monotonic() + cmd_str = " ".join(str(a) for a in args) + proc = await asyncio.create_subprocess_exec(*args, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE) + try: + async with asyncio.timeout(timeout): + stdout, stderr = await proc.communicate() + except TimeoutError: + proc.kill() + await proc.wait() + log.error("Command: %s timed out after %d seconds and was killed", cmd_str, timeout) + raise RuntimeError("Process timed out and was killed") + elapsed = time.monotonic() - start + log.debug("Command: %s, Exit code: %d, Time Elapsed: %f", cmd_str, proc.returncode, elapsed) + if stderr: + log.warning("stderr:\n%s", stderr.decode()) + return CommandResult(returncode=proc.returncode, stdout=stdout, stderr=stderr) + + def launch_task(self, func, interval) -> None: + """ + Launch an asynchronous task that executes a callable function at regular intervals. + """ + + asyncio.create_task(self.__schedule(func, interval)) + + async def __schedule(self, func, interval: int) -> None: + """ + Schedule a callable function to be executed periodically at specified intervals. + """ + if callable(func): + reponse = await func() + loop = asyncio.get_running_loop() + loop.call_later(interval, partial(self.launch_task, func, interval)) + else: + log.error(f"func {func.__name__} is not callable") + +class AzslurmCollector: + def __init__(self): + self.collectors = [] + + def initialize_collectors(self) -> None: + """ + Initialize and start all collectors concurrently. + - Squeue: Collects job queue information + - Sacct: Collects job accounting data + - Sinfo: Collects node/partition + - Azslurm: Collects partition specs + - Jetpack: Collects cluster specs + """ + try: + from exporter.squeue import Squeue, SqueueNotAvailException + squeue = Squeue() + squeue.initialize() + except SqueueNotAvailException: + log.warning("squeue is not available, disabling squeue metrics") + else: + self.collectors.append(squeue) + + try: + from exporter.sacct import Sacct, SacctNotAvailException + sacct = Sacct() + sacct.initialize() + except SacctNotAvailException: + log.warning("Accounting is disabled, disabling sacct metrics") + else: + self.collectors.append(sacct) + + try: + from exporter.sinfo import Sinfo, SinfoNotAvailException + sinfo = Sinfo() + sinfo.initialize() + except SinfoNotAvailException: + log.warning("sinfo is not available, disabling sinfo metrics") + else: + self.collectors.append(sinfo) + + try: + from exporter.azslurm import Azslurm, AzslurmNotAvailException + azslurm = Azslurm() + azslurm.initialize() + except AzslurmNotAvailException: + log.warning("azslurm is not available, disabling azslurm metrics") + else: + self.collectors.append(azslurm) + + try: + from exporter.jetpack import Jetpack, JetpackNotAvailException + jetpack = Jetpack() + jetpack.initialize() + except JetpackNotAvailException: + log.warning("jetpack is not available, disabling jetpack metrics") + else: + self.collectors.append(jetpack) + + if not self.collectors: + log.error("No collectors intialized") + raise NoCollectorsFoundException + for collector in self.collectors: + collector.start() + + def export_metrics(self) -> List[Union[Gauge,Counter,Summary]]: + """ + Collect and aggregate metrics from all configured collectors. + """ + + metrics = [] + for collector in self.collectors: + metrics.extend(collector.export_metrics()) + return metrics + + def collect(self) -> Iterator[Metric]: + """ + Collect and yield Prometheus metrics every scrape interval + """ + + metrics = self.export_metrics() + for metric in metrics: + yield from metric.collect() + + async def start_http_server(self, host:str, port:int) -> web.AppRunner: + """ + Initializes and starts an aiohttp web server that + exposes Prometheus metrics on the /metrics endpoint. The server listens on + the specified host and port. + """ + try: + registry = CollectorRegistry() + registry.register(self) + app = web.Application() + app.router.add_get("/metrics", make_aiohttp_handler(registry)) + runner = web.AppRunner(app) + await runner.setup() + site = web.TCPSite(runner, host, port) + await site.start() + log.info("Prometheus exporter serving on http://%s:%d/metrics", host, port) + return runner + except OSError as e: + log.error("Failed to bind to %s:%d - %s", host, port, e) + raise HTTPServerFailedException + except Exception as e: + log.error("Unexpected error starting HTTP server: %s", e) + raise HTTPServerFailedException + + +async def main(): + conf_file = resources.files("exporter").joinpath("exporter_logging.conf") + logging.config.fileConfig(str(conf_file)) + + loop = asyncio.get_running_loop() + stop_event = asyncio.Event() + + for sig in (signal.SIGINT, signal.SIGTERM): + loop.add_signal_handler(sig, stop_event.set) + + collector = AzslurmCollector() + + try: + collector.initialize_collectors() + except NoCollectorsFoundException: + sys.exit(1) + + try: + runner = await collector.start_http_server(host="0.0.0.0", port=9101) + except HTTPServerFailedException: + sys.exit(1) + + # Keep running until interrupted + try: + await stop_event.wait() + except asyncio.CancelledError: + pass + finally: + await runner.cleanup() \ No newline at end of file diff --git a/azure-slurm-exporter/exporter/exporter_logging.conf b/azure-slurm-exporter/exporter/exporter_logging.conf new file mode 100644 index 00000000..944f8a3d --- /dev/null +++ b/azure-slurm-exporter/exporter/exporter_logging.conf @@ -0,0 +1,31 @@ +[loggers] +keys=root + +[handlers] +keys=consoleHandler, fileHandler + +[formatters] +keys=simpleFormatter + +[logger_root] +level=DEBUG +handlers=consoleHandler, fileHandler + +[handler_fileHandler] +class=logging.handlers.RotatingFileHandler +level=DEBUG +formatter=simpleFormatter +args=("/var/log/azslurm-exporter.log", "a", 1024 * 1024 * 5, 5) + +[handler_consoleHandler] +class=StreamHandler +level=ERROR +formatter=simpleFormatter +args=(sys.stderr,) + +[formatter_simpleFormatter] +format=%(asctime)s %(levelname)s: %(message)s +datefmt=%Y-%m-%d %H:%M:%S + +[formatter_reproFormatter] +format=%(message)s \ No newline at end of file diff --git a/azure-slurm-exporter/exporter/jetpack.py b/azure-slurm-exporter/exporter/jetpack.py new file mode 100644 index 00000000..bad74528 --- /dev/null +++ b/azure-slurm-exporter/exporter/jetpack.py @@ -0,0 +1,70 @@ +from .exporter import BaseCollector +from collections import namedtuple +from prometheus_client import Gauge +import logging +import exporter.util as util +from typing import List +log = logging.getLogger(__name__) + +class JetpackNotAvailException(Exception): + pass + +class Jetpack(BaseCollector): + + def __init__(self, binary_path="/opt/cycle/jetpack/bin/jetpack", interval=86400, timeout=120): + self.binary_path = binary_path + self.interval = interval + self.timeout = timeout + self.cached_output = {"jetpack_metrics":[]} + self.default_options = ["config", "azure.metadata.compute.location", "None"] + + def initialize(self) -> None: + """ + Initialize the Jetpack instance by validating the binary + """ + if not util.is_file_binary(self.binary_path): + log.error(f"{self.binary_path} is not a file or not executable") + raise JetpackNotAvailException + + def start(self) -> None: + """ + Begin collecting metrics asynchronously and runs it at regular + intervals as defined by the configured downstream interval. + """ + self.launch_task(func=self.jetpack_query, interval=self.interval) + + def export_metrics(self) -> List[Gauge]: + """ + Return metrics in Prometheus-compatible format from cache. + """ + #TODO: DO we need to lock this? + return self.cached_output["jetpack_metrics"] + + def parse_output(self, stdout) -> None: + """ + Parse jetpack command stdout and return prometheus gauge for cluster specs + """ + jetpack_cluster_info = Gauge("jetpack_cluster_info", "Cluster Metadata", + labelnames=["region"], + registry=None) + region = stdout.decode().strip() + jetpack_cluster_info.labels(region=region).set(1) + return [jetpack_cluster_info] + + async def jetpack_query(self) -> None: + """ + Run jetpack query with default options and save parsed result in prometheus + metrics format to cache + """ + args = [self.binary_path] + args.extend(self.default_options) + + try: + proc = await self.run_command(timeout=self.timeout,*args) + except Exception as e: + log.error(e) + return + + self.cached_output["jetpack_metrics"] = self.parse_output(proc.stdout) + + diff --git a/azure-slurm-exporter/exporter/main.py b/azure-slurm-exporter/exporter/main.py new file mode 100644 index 00000000..58dcd5de --- /dev/null +++ b/azure-slurm-exporter/exporter/main.py @@ -0,0 +1,8 @@ +import asyncio +from exporter.exporter import main as async_main + +def main(): + asyncio.run(async_main()) + +if __name__=="__main__": + main() \ No newline at end of file diff --git a/azure-slurm-exporter/exporter/sacct.py b/azure-slurm-exporter/exporter/sacct.py new file mode 100644 index 00000000..65c7c25e --- /dev/null +++ b/azure-slurm-exporter/exporter/sacct.py @@ -0,0 +1,111 @@ +from .exporter import BaseCollector +from collections import namedtuple +from prometheus_client import Counter, disable_created_metrics +from datetime import datetime, timedelta +import logging +import exporter.util as util +from typing import List +log = logging.getLogger(__name__) + +class SacctNotAvailException(Exception): + pass + +class Sacct(BaseCollector): + + SLURM_EXIT_CODE_MAPPING = { + "0:0": "", + "1:0": "General failure", + "2:0": "Misuse of shell built-in", + "125:0": "Slurm Out of Memory Error", + "126:0": "Command invoked cannot execute", + "127:0": "Command not found", + "128:0": "Invalid argument to exit", + "129:0": "SIGHUP", + "130:0": "SIGINT - Ctrl+C", + "131:0": "SIGQUIT", + "134:0": "SIGABRT", + "137:0": "SIGKILL - Force killed", + "139:0": "SIGSEGV - Segfault", + "141:0": "SIGPIPE", + "143:0": "SIGTERM - Terminated", + "152:0": "SIGXCPU - CPU limit", + "153:0": "SIGXFSZ - File size limit", + } + + def __init__(self, binary_path="/usr/bin/sacct", interval=300, timeout=120): + self.binary_path = binary_path + self.interval = interval + self.timeout = timeout + self.sacct_terminal_jobs= Counter("sacct_terminal_jobs","Total Number of completed slurm jobs", + ["partition", "exit_code","reason","state", "nodelist"], registry=None) + self.default_output_fmt = "jobid,jobname,nodelist,nnodes,partition,exitcode,derivedexitcode,state,user,start,submit,end,reason" + self.sacct_output = namedtuple("sacct_output", self.default_output_fmt) + self.terminal_states = "completed,failed,cancelled,timeout,node_fail,preempted,out_of_memory,deadline,boot_fail" + self.starttime = (datetime.now() - timedelta(hours=1)).strftime("%Y-%m-%dT%H:%M:%S") + self.endtime = datetime.now().strftime("%Y-%m-%dT%H:%M:%S") + self.default_options = ["-P" ,"-n", "-o", self.default_output_fmt, "-s", self.terminal_states, "--allocations" ] + + def initialize(self) -> None: + """ + Initialize the Sacct instance by validating the binary and disabling created metrics. + """ + if not util.is_file_binary(self.binary_path): + log.error(f"{self.binary_path} is not a file or not executable") + raise SacctNotAvailException + disable_created_metrics() + + def start(self) -> None: + """ + Begin collecting metrics asynchronously and runs it at regular + intervals as defined by the configured downstream interval. + """ + self.launch_task(func=self.sacct_query, interval=self.interval) + + def export_metrics(self) -> List[Counter]: + """ + Return metrics in Prometheus-compatible format. + """ + #TODO: DO we need to lock this? + return [self.sacct_terminal_jobs] + + def parse_output(self, stdout) -> None: + """ + Parse sacct command output and increment terminal job metrics. + """ + lines = stdout.decode().strip().splitlines() + log.debug(f"Number of jobs:{len(lines)}") + lines_iter = (line.split("|") for line in lines) + for row in map(self.sacct_output._make, lines_iter): + reason = self.SLURM_EXIT_CODE_MAPPING.get(row.exitcode, "") + self.sacct_terminal_jobs.labels( + partition=row.partition, + exit_code=row.exitcode, + reason=reason, + state=row.state, + nodelist=row.nodelist).inc() + + async def sacct_query(self) -> None: + """ + Queries sacct filtering by a time window of size `interval`, which represents + the frequency at which queries are executed. + + The time window progresses as follows: + - The start time of the current query is set to the end time of the previous query + - The end time is set to the current moment when the query is executed + - After execution, starttime is updated to the current endtime for the next query iteration + """ + args = [] + self.endtime = datetime.now().strftime("%Y-%m-%dT%H:%M:%S") + args.append(self.binary_path) + args.extend(self.default_options) + args.extend(["--starttime", self.starttime, "--endtime", self.endtime]) + log.debug(f"running sacct query between {self.starttime} and {self.endtime}") + try: + proc = await self.run_command(timeout=self.timeout,*args) + except Exception as e: + log.error(e) + return + self.starttime=self.endtime + self.parse_output(proc.stdout) + + diff --git a/azure-slurm-exporter/exporter/sinfo.py b/azure-slurm-exporter/exporter/sinfo.py new file mode 100644 index 00000000..aa44809e --- /dev/null +++ b/azure-slurm-exporter/exporter/sinfo.py @@ -0,0 +1,106 @@ +from .exporter import BaseCollector +from collections import namedtuple +from prometheus_client import Gauge +import logging +import exporter.util as util +from typing import List + +log = logging.getLogger(__name__) + +class SinfoNotAvailException(Exception): + pass + +class Sinfo(BaseCollector): + + # Suffix meanings from sinfo man page + STATE_SUFFIXES = { + "*": "not_responding", + "~": "powered_off", + "#": "powering_up", + "!": "pending_power_down", + "%": "powering_down", + "$": "maintenance_reservation", + "@": "pending_reboot", + "^": "reboot_issued", + "-": "planned_backfill", + } + + def __init__(self, binary_path="/usr/bin/sinfo", interval=30, timeout=15): + self.binary_path = binary_path + self.interval = interval + self.timeout = timeout + self.cached_output = {"sinfo_query":[]} + self.default_output_fmt = f"%N|%D|%R|%E|%T" + self.default_output_headers = "nodelist,nodes,partition,reason,state" + self.sinfo_output = namedtuple("sinfo_output", self.default_output_headers) + self.default_options = ["-h", "-o", self.default_output_fmt] + + def initialize(self) -> None: + """ + Initialize the Sinfo object by validating the binary path. + """ + if not util.is_file_binary(self.binary_path): + log.error(f"{self.binary_path} is not a file or not executable") + raise SinfoNotAvailException + + def start(self) -> None: + """ + Begin collecting metrics asynchronously and runs it at regular + intervals as defined by the configured downstream interval. + """ + self.launch_task(func=self.sinfo_query, interval=self.interval) + + def export_metrics(self) -> List[Gauge]: + """ + Return metrics in Prometheus-compatible format from cache. + """ + return self.cached_output["sinfo_query"] + + def normalize_state(self, state) -> str: + """ + Normalize SLURM node state by mapping state suffixes to their base states. If + node state has a suffix, then we set that node's state to the suffix state. + """ + suffix = state[-1] + if suffix in self.STATE_SUFFIXES: + return self.STATE_SUFFIXES[suffix] + else: + return state + + def parse_output(self, stdout) -> List[Gauge]: + """ + Parse the output from the sinfo command and create a Gauge metric that track each nodelist's + state per partition. + """ + sinfo_partitions_nodes_state = Gauge( + f"sinfo_partition_nodes_state", + f"Number of nodes in a state per partition", + labelnames=['node_list','partition','state', 'reason'], registry=None + ) + + lines = stdout.decode().strip().splitlines() + lines_iter = (line.split("|") for line in lines) + for row in map(self.sinfo_output._make, lines_iter): + state = self.normalize_state(row.state) + sinfo_partitions_nodes_state.labels(node_list=row.nodelist, + partition=row.partition, + state=state, + reason=row.reason).set(float(row.nodes)) + return [sinfo_partitions_nodes_state] + + + async def sinfo_query(self) -> None: + """ + Execute sinfo command asynchronously and cache the parsed output in prometheus metric format. + """ + args = [self.binary_path] + args.extend(self.default_options) + try: + proc = await self.run_command(timeout=self.timeout,*args) + except Exception as e: + log.error(e) + return + output = self.parse_output(proc.stdout) + #TODO: DO we need to lock this? + self.cached_output["sinfo_query"] = output + diff --git a/azure-slurm-exporter/exporter/squeue.py b/azure-slurm-exporter/exporter/squeue.py new file mode 100644 index 00000000..46e03824 --- /dev/null +++ b/azure-slurm-exporter/exporter/squeue.py @@ -0,0 +1,122 @@ +from .exporter import BaseCollector +from collections import namedtuple +from prometheus_client import Gauge +from dataclasses import dataclass, field +import logging +import exporter.util as util +from typing import List +# @dataclass +# class SqueueMetrics: +# squeue_partition_jobs_state: GaugeMetricFamily = GaugeMetricFamily( +# f"squeue_partition_jobs_state", +# f"Number of jobs in a state per partition", +# labels=['partition','state'], +# ) + +# squeue_job_nodes_allocated: GaugeMetricFamily = GaugeMetricFamily( +# "squeue_job_nodes_allocated", +# "Number of nodes allocated to a running job", +# labels=["job_id", "job_name", "partition", "state"], +# ) + +# def add(self, label: list, value): + + + +log = logging.getLogger(__name__) + +class SqueueNotAvailException(Exception): + pass + +class Squeue(BaseCollector): + + def __init__(self, binary_path="/usr/bin/squeue", interval=60, timeout=30): + self.binary_path = binary_path + self.interval = interval + self.timeout = timeout + self.cached_output = {"squeue_metrics":[]} + self.default_output_fmt = f"%i|%j|%D|%N|%P|%T|%V|%u" + self.default_output_headers = "jobid,name,nodes,nodelist,partition,state,submit_time,user" + self.squeue_output = namedtuple("squeue_output", self.default_output_headers) + self.default_options = ["-h", "-o", self.default_output_fmt] + + def initialize(self) -> None: + """ + Initialize the Squeue instance and validate the binary executable. + """ + if not util.is_file_binary(self.binary_path): + log.error(f"{self.binary_path} is not a file or not executable") + raise SqueueNotAvailException + + def start(self) -> None: + """ + Begin collecting metrics asynchronously and runs it at regular + intervals as defined by the configured downstream interval. + """ + self.launch_task(func=self.squeue_query, interval=self.interval) + + def export_metrics(self) -> List[Gauge]: + """ + Return metrics in Prometheus-compatible format from cache. + """ + return self.cached_output["squeue_metrics"] + + def parse_output(self,stdout) -> List[Gauge]: + """ + Parse the stdout from an squeue command and generates two Prometheus + Gauge metrics: + + 1. squeue_partition_jobs_state: Tracks the number of jobs in each state per partition + - Labels: partition, state + + 2. squeue_job_nodes_allocated: Tracks the number of nodes allocated to running jobs + - Labels: job_id, job_name, partition, state + - Only populated for jobs in "running" state + """ + squeue_partition_jobs_state = Gauge( + f"squeue_partition_jobs_state", + f"Number of jobs in a state per partition", + labelnames=['partition','state'], registry=None + ) + + squeue_job_nodes_allocated = Gauge( + "squeue_job_nodes_allocated", + "Number of nodes allocated to a running job", + labelnames=["job_id", "job_name", "partition", "state", "nodelist"], registry=None + ) + # number of jobs per state,partition key + counts = {} + lines = stdout.decode().strip().splitlines() + lines_iter = (line.split("|") for line in lines) + + for row in map(self.squeue_output._make, lines_iter): + if row.state.lower() == "running": + squeue_job_nodes_allocated.labels(job_id=row.jobid, + job_name=row.name, + partition=row.partition, + state=row.state, + nodelist=row.nodelist).set(float(row.nodes)) + key = (row.partition, row.state.lower()) + counts[key] = counts.get(key, 0) + 1 + + for (partition, state), count in counts.items(): + squeue_partition_jobs_state.labels(partition=partition, + state=state).set(count) + + return [squeue_partition_jobs_state, squeue_job_nodes_allocated] + + async def squeue_query(self) -> None: + """ + Execute an squeue query asynchronously and cache the parsed result in prometheus + metrics format. + """ + args = [self.binary_path] + args.extend(self.default_options) + try: + proc = await self.run_command(timeout=self.timeout,*args) + except Exception as e: + log.error(e) + return + output = self.parse_output(proc.stdout) + #TODO: DO we need to lock this? + self.cached_output["squeue_metrics"] = output diff --git a/azure-slurm-exporter/exporter/util.py b/azure-slurm-exporter/exporter/util.py new file mode 100644 index 00000000..274fc51d --- /dev/null +++ b/azure-slurm-exporter/exporter/util.py @@ -0,0 +1,7 @@ +import os + +def is_file_binary(binary) -> bool: + """ + Check if a file exists and is executable. + """ + return os.path.isfile(binary) and os.access(binary, os.X_OK) \ No newline at end of file diff --git a/azure-slurm-exporter/install.sh b/azure-slurm-exporter/install.sh new file mode 100755 index 00000000..3ca3da41 --- /dev/null +++ b/azure-slurm-exporter/install.sh @@ -0,0 +1,117 @@ +#!/usr/bin/env bash +# Copyright (c) Microsoft Corporation. All rights reserved. +# Licensed under the MIT License. +# +set -e + +find_python3() { + export PATH=$(echo $PATH | sed -e 's/\/opt\/cycle\/jetpack\/system\/embedded\/bin://g' | sed -e 's/:\/opt\/cycle\/jetpack\/system\/embedded\/bin//g') + if [ ! -z $AZSLURM_PYTHON_PATH ]; then + echo $AZSLURM_PYTHON_PATH + return 0 + fi + for version in $( seq 11 20 ); do + which python3.$version + if [ $? == 0 ]; then + return 0 + fi + done + echo Could not find python3 version 3.11 >&2 + return 1 +} + +setup_venv() { + + set -e + + $PYTHON_PATH -c "import sys; sys.exit(0)" || (echo "$PYTHON_PATH is not a valid python3 executable. Please install python3.11 or higher." && exit 1) + $PYTHON_PATH -m pip --version > /dev/null || $PYTHON_PATH -m ensurepip + $PYTHON_PATH -m venv $VENV + + set +e + source $VENV/bin/activate + set -e + + if ! pip install --force-reinstall $PACKAGE; then + echo "ERROR: Failed to install $PACKAGE" + deactivate || true + exit 1 + fi + +} +add_scraper() { + # If az_exporter is already configured, do not add it again + if grep -q "azslurm_exporter" $PROM_CONFIG; then + echo "AzSlurm Exporter is already configured in Prometheus" + return 0 + fi + INSTANCE_NAME=$(hostname) + + cat > azslurm-exporter.yml <<-EOF + scrape_configs: + - job_name: azslurm_exporter + static_configs: + - targets: ["instance_name:9101"] + relabel_configs: + - source_labels: [__address__] + target_label: instance + regex: '([^:]+)(:[0-9]+)?' + replacement: '\${1}' +EOF + + yq eval-all '. as $item ireduce ({}; . *+ $item)' $PROM_CONFIG azslurm-exporter.yml > tmp.yml + mv -vf tmp.yml $PROM_CONFIG + + # update the configuration file + sed -i "s/instance_name/$INSTANCE_NAME/g" $PROM_CONFIG +} + + +setup_azslurm_exporter() { + cat > /etc/systemd/system/azslurm-exporter.service <=68.0", "wheel"] +build-backend = "setuptools.build_meta" + +[project] +name = "azure-slurm-exporter" +version = "0.1.0" +description = "Prometheus exporter for Azure CycleCloud Slurm metrics" +requires-python = ">=3.11" +license = "MIT" +authors = [ + { name = "Azure CycleCloud Team" }, +] +dependencies = [ + "prometheus_client", + "aiohttp" +] + +[project.optional-dependencies] +dev = [ + "pytest", +] + +[project.scripts] +azslurm-exporter = "exporter.main:main" + +[tool.setuptools.packages.find] +include = ["exporter*"] + +[tool.setuptools.package-data] +exporter = ["*.conf"] \ No newline at end of file diff --git a/images/azslurmexporterdash.png b/images/azslurmexporterdash.png new file mode 100755 index 00000000..0c1217e7 Binary files /dev/null and b/images/azslurmexporterdash.png differ