charmed-hpc · NucciTheBoss · Feb 26, 2026 · Feb 25, 2026 · Feb 25, 2026 · Feb 25, 2026
@@ -52,6 +52,12 @@ Graphics Processing Unit (GPU)
     A specialized processor that is designed to accelerate image processing and graphics rendering
     for output to a display device.
 
+Grafana
+    An open-source platform for data visualization, monitoring, and observability. Used to create
+    dashboards and graphs from time series data stored in various data sources.
+
+    Resources: [Grafana website {octicon}`link-external`](https://grafana.com/), [Grafana documentation {octicon}`link-external`](https://grafana.com/docs/), [Grafana charm {octicon}`link-external`](https://charmhub.io/grafana-k8s)
+
 High-Performance Computing (HPC)
     The practice of aggregating computing power using clusters and parallel processing to complete
     tasks faster than standard computing.
@@ -78,6 +84,12 @@ Juju
 
     Resources: [Juju documentation {octicon}`link-external`](https://documentation.ubuntu.com/juju/latest/)
 
+Loki
+    An open-source log aggregation system designed to store and query logs from applications and
+    infrastructure. Designed to be cost-effective and easy to operate.
+
+    Resources: [Grafana Loki website {octicon}`link-external`](https://grafana.com/oss/loki/), [Loki documentation {octicon}`link-external`](https://grafana.com/docs/loki/latest/), [Loki charm {octicon}`link-external`](https://charmhub.io/loki-k8s)
+
 MicroCeph
     A tool that simplifies deployment and management of Ceph storage both standalone and in a
     charmed environment using Juju.
@@ -106,6 +118,12 @@ Proxy charm
     An intermediary charm that enables charms to integrate with non-charmed applications. Also known
     as an integrator charm.
 
+Prometheus
+    An open-source monitoring and alerting system that collects and stores metrics as time series
+    data. Features a flexible query language (PromQL) and built-in alerting capabilities.
+
+    Resources: [Prometheus website {octicon}`link-external`](https://prometheus.io/), [Prometheus documentation {octicon}`link-external`](https://prometheus.io/docs/), [Prometheus charm {octicon}`link-external`](https://charmhub.io/prometheus-k8s)
+
 `sackd`
     Slurm Auth and Credential Kiosk daemon. Typically used to provide cluster login nodes.
 

@@ -1,11 +1,15 @@
+---
+relatedlinks: "[Grafana&#32;dashboards&#32;documentation](https://grafana.com/docs/grafana/latest/dashboards/)"
+---
+
 (reference-monitoring-grafana)=
 # Grafana dashboards
 
 This is an overview of all the charms used in Charmed HPC that provide dashboards for
-[Grafana](https://grafana.com/), which acts as a web interface to visualize data from aggregators such
-as [Prometheus](https://prometheus.io/) or [Loki](https://grafana.com/oss/loki/). See
-[Grafana dashboards](https://grafana.com/docs/grafana/latest/dashboards/) for more general information on dashboards,
-and {ref}`reference-monitoring-prometheus` for more information about the metrics displayed on the dashboards.
+{term}`Grafana`, which acts as a web interface to visualize data from aggregators such
+as {term}`Prometheus` or {term}`Loki`.
+
+See {ref}`howto-manage-integrate-with-cos` for more information.
 
 :::{admonition} Panel query
 :class: note
@@ -16,39 +20,62 @@ to see the exact query used to provide the panel with data.
 
 ## Slurmctld
 
-The dashboard from the `slurmctld` charm displays an overall view of the cluster, including the following information:
+The dashboards from the {term}`slurmctld` charm provide a display of information from the
+entire cluster, each partition, and each charm.
+
+### Cluster Overview
+
+The "Cluster Overview" dashboard provides a display of cluster-level metrics such as:
+
+- Total resource utilization
+- Job status distribution
+- Node state distribution
+- Scheduler metrics
+
+![Grafana Cluster Overview dashboard showing total resource utilization, job state distribution, node state distribution, and scheduler metrics for the Charmed HPC cluster](/reuse/reference/monitoring/cluster-overview.png)
+
+### Partition Overview
+
+The "Partition Overview" dashboard provides a display of partition-level metrics such as:
+
+- Total nodes and jobs in the partition
+- Total resource utilization for the partition
+- Job status distributing for jobs in the partition
+- Node state distribution for all nodes in the partition
+
+![Grafana Partition Overview dashboard showing total nodes and jobs, resource utilization, job status distribution, and node state distribution for a specific partition](/reuse/reference/monitoring/partition-overview.png)
+
+### Node Overview
+
+The "Node Overview" dashboard provides a display of node-level metrics such as:
 
-- CPU and memory usage per partition.
-- Node state count.
-- CPU and memory usage per account.
-- Statistics on Slurmctld RPC messages.
+- Available resources that are allocatable for jobs
+- Total resource utilization on the node
 
-![Slurm partition dashboard](../../imgs/slurm_partition_grafana.png)
-![Slurm account dashboard](../../imgs/slurm_account_grafana.png)
-![Slurm rpc dashboard](../../imgs/slurm_rpc_grafana.png)
+![Grafana Node Overview dashboard showing node state, resource utilization, running jobs, and hardware configuration for a specific compute node](/reuse/reference/monitoring/node-overview.png)
 
 ## MySQL
 
 The dashboard from the `mysql` charm displays metrics for the storage database of Slurmdbd:
 
-- Uptime.
-- Queries per second.
-- Current cache size.
-- Maximum number of concurrent connections.
-- Thread resource usage.
-- Network traffic statistics.
+- Uptime
+- Queries per second
+- Current cache size
+- Maximum number of concurrent connections
+- Thread resource usage
+- Network traffic statistics
 
-![MySQL dashboard](../../imgs/mysql_grafana.png)
+![Grafana MySQL dashboard showing database metrics including uptime, queries per second, cache size, concurrent connections, and thread count](/reuse/reference/monitoring/mysql-exporter.png)
 
 ## Traefik K8s
 
 The dashboard from the `traefik-k8s` charm displays metrics about the reverse proxy used when communicating
 between the compute plane cluster and the monitoring/identity k8s clusters. This includes:
 
-- Uptime.
-- HTTP response code statistics.
-- Response times.
-- Open connections statistics.
-- Raw logs for every proxied endpoint.
+- Uptime
+- Response times
+- HTTP response code statistics
+- Open connection statistics
+- Raw logs for every proxied endpoint
 
-![Traefik dashboard](../../imgs/traefik_grafana.png)
+![Grafana Traefik dashboard showing reverse proxy metrics including uptime, response times, HTTP response code statistics, and open connection statistics](/reuse/reference/monitoring/traefik.png)
@@ -1,9 +1,9 @@
 (reference-monitoring)=
 # Monitoring
 
-Integrating COS with Charmed HPC enables monitoring of resources. This integration includes Prometheus, Grafana, and Loki.
-
-
+Integrating COS with Charmed HPC enables you to monitor your Charmed HPC clusters resources
+with {term}`Prometheus`, {term}`Grafana`, and {term}`Loki`. The reference material in
+this section list the dashboards and metrics from Charmed HPC that you can interact with.
 
 - {ref}`Prometheus: metrics aggregation and alerts <reference-monitoring-prometheus>`
 - {ref}`Grafana: Dashboards and resource visualizations <reference-monitoring-grafana>`
@@ -17,5 +17,4 @@ Integrating COS with Charmed HPC enables monitoring of resources. This integrati
 Prometheus <prometheus>
 Grafana <grafana>
 Loki <loki>
-
 ```
@@ -1,8 +1,8 @@
 (reference-monitoring-loki)=
 # Loki logs
 
-The following table lists all the charms used as part of Charmed HPC that expose logs to Loki, and the
-corresponding query to see the exported logs in the Grafana UI.
+The following table lists all the charms used as part of Charmed HPC that expose logs to {term}`Loki`, and the
+corresponding query to see the exported logs in {term}`Grafana`.
 Follow the [Visualize log data](https://grafana.com/docs/loki/latest/visualize/grafana/#grafana-explore)
 tutorial from the Grafana documentation for instructions on where and how to query for Loki logs.
 

@@ -2,20 +2,20 @@
 # Prometheus metrics and alerts
 
 This is an overview of all the charms used in Charmed HPC that provide monitoring metrics and alerts
-for [Prometheus](https://prometheus.io), a metrics aggregator and alerts manager for applications.
+for {term}`Prometheus`, a metrics aggregator and alerts manager for applications.
 
-All metrics and alerts can be viewed from Prometheus or from the [Grafana](https://grafana.com) web interface.
+All metrics and alerts can be viewed from Prometheus or from the {term}`Grafana` web interface.
 See {ref}`howto-manage-integrate-with-cos` for more information.
 
-The following table lists all the charms on Charmed HPC that expose metrics and alerts to Prometheus,
+The following table lists all the charms on Charmed HPC that expose metrics and alerts to Prometheus
 with their corresponding upstream documentation to know more about the metrics exported. The last
-column shows the corresponding query to list the exported metrics in Prometheus or the Grafana UI.
+column shows the corresponding query to list the exported metrics in Prometheus or Grafana.
 
 :::{csv-table}
 :header: >
 : charm, upstream docs, query
 
-slurmctld, [Documentation](https://github.com/rivosinc/prometheus-slurm-exporter), `{juju_charm="slurmctld"}`{l=javascript}
+slurmctld, [Documentation](https://slurm.schedmd.com/metrics.html), `{juju_charm="slurmctld"}`{l=javascript}
 mysql, [Documentation](https://charmhub.io/mysql), `{juju_charm="mysql"}`{l=javascript}
 postgresql-k8s, [Documentation](https://charmhub.io/postgresql-k8s), `{juju_charm="postgresql-k8s"}`{l=javascript}
 glauth-k8s, [Documentation](https://charmhub.io/glauth-k8s), `{juju_charm="glauth-k8s"}`{l=javascript}
@@ -26,7 +26,7 @@ traefik-k8s, [Documentation](https://charmhub.io/traefik-k8s), `{juju_charm="tra
 
 The `slurmctld` charm exposes metrics related to:
 
-- Resource usage per partition, account or user.
-- Jobs statuses.
-- RPC messages for `slurmctld`.
-- Prometheus Slurm Exporter statistics.
+- Job and node statuses.
+- Resource usage for each partition, node, Slurm account or user.
+- Cluster-wide information such as total CPU or memory utilization.
+- Scheduler information such scheduling cycle times and queue lengths.